key: cord-0601895-on162kcm authors: Handa, Shivam; Kallas, Konstantinos; Vasilakis, Nikos; Rinard, Martin title: An Order-Aware Dataflow Model for Parallel Unix Pipelines date: 2020-12-31 journal: nan DOI: nan sha: eb3b00d9d0f9dfb20a1bc79ab806f8e791ef779c doc_id: 601895 cord_uid: on162kcm We present a dataflow model for modelling parallel Unix shell pipelines. To accurately capture the semantics of complex Unix pipelines, the dataflow model is order-aware, i.e., the order in which a node in the dataflow graph consumes inputs from different edges plays a central role in the semantics of the computation and therefore in the resulting parallelization. We use this model to capture the semantics of transformations that exploit data parallelism available in Unix shell computations and prove their correctness. We additionally formalize the translations from the Unix shell to the dataflow model and from the dataflow model back to a parallel shell script. We implement our model and transformations as the compiler and optimization passes of a system parallelizing shell pipelines, and use it to evaluate the speedup achieved on 47 pipelines. Unix pipelines are an attractive choice for specifying succinct and simple programs for data processing, system orchestration, and other automation tasks [McIlroy et al. 1978 ]. Consider, for example, the following program based on the original spell written by Johnson [Bentley 1985 ], lightly modified for modern environments: 1 cat f1.md f2.md | tr A-Z a-z | tr -cs A-Za-z ' ' | sort | uniq | # (Spell) grep -vx -f dict.txt -> out ; cat out | wc -l | sed 's/$/ mispelled words!/' The first command streams two markdown files into a pipeline that converts characters in the stream into lower case, removes punctuation, sorts the stream in alphabetical order, removes duplicate words, and filters out words from a dictionary file (lines 1 and 2, up to ";"). A second pipeline (line 2, after ";") counts the resulting lines to report the number of misspelled words to the user. As this example illustrates, the Unix shell offers a programming model that facilitates the composition of commands using unidirectional communication channels that feed the output of one command as an input to another. These channels are either ephemeral, unnamed pipes expressed using the | character and lasting for the duration of the producer and consumer, or persistent, named pipes (Unix FIFOs) created with mkfifo and lasting until explicitly deleted. Each command executes sequentially, with pipelined parallelism available between commands executing in the same pipeline. Unfortunately, this model leaves substantial data parallelism, i.e., parallelism achieved by splitting inputs into pieces and feeding the pieces to parallel instances of the script, unexploited. This fact is known in the Unix community and has motivated the development of a variety of tools that attempt to exploit latent data parallelism in Unix scripts [Raghavan et al. 2020; Tange 2011; Vasilakis et al. 2021 ]. On the one hand, tools such as GNU Parallel [Tange 2011] can be used by experienced users to achieve parallelism, but could also easily lead to incorrect results. On the other hand, two recent systems, PaSh [Vasilakis et al. 2021 ] and POSH [Raghavan et al. 2020] , focus respectively on extracting data parallelism latent in Unix pipelines to improve the execution time of (i) CPU-intensive shell scripts (PaSh), and (ii) networked IO-intensive shell scripts (POSH). These systems achieve order-of-magnitude performance improvements on sequential pipelines, but their semantics and associated transformations are not clearly defined, making it difficult to ensure that the optimized parallel scripts are sound with respect to the sequential ones. To support the ability to reason about and correctly transform Unix shell pipelines, we present a new dataflow model. In contrast to standard dataflow models [Kahn 1974; Kahn and MacQueen 1977; Karp and Miller 1966; Lee and Messerschmitt 1987a,b ], our dataflow model is order-aware-i.e., the order in which a node in the dataflow graph consumes inputs from different edges plays a central role in the semantics of the computation and therefore in the resulting parallelization. This model is different from models that allow multiplexing different chunks of data in a single channel, such as sharding or tagging, or ones that are oblivious to ordering, such as shuffling-and is a direct byproduct of the ordered semantics of the shell and the opacity of Unix commands. In the Spell script shown earlier, for example, while all commands consume elements from an input stream in order-a property of Unix streams e.g., pipes and FIFOs-they differ in how they consume across streams: cat reads input streams in the order of its arguments, sort -m reads input streams in interleaved fashion, and grep -vx -f first reads dict.txt before reading from its standard input. We use this order-aware dataflow model (ODFM) to express the semantics of transformations that exploit data parallelism available in Unix shell computations. These transformations capture the parallelizing optimizations performed by both PaSh [Vasilakis et al. 2021 ] and POSH [Raghavan et al. 2020] . We also use our model to prove that these transformations are correct, i.e., that they do not affect the program behavior with respect to sequential output. Finally, we formalize bidirectional translations between the shell and ODFM, namely from the shell language to ODFM and vice versa, closing the loop for a complete shell-to-shell parallelizing compiler. To illustrate the applicability of our model, we extend PaSh by reimplementing its compilation and optimization phases with ODFM as the centerpiece. The new implementation translates fragments of a shell script to our dataflow model, applies a series of parallelizing transformations, and then translates the resulting parallel dataflow graph back to a parallel shell script that is executed instead of the original fragment. Our new implementation improves modularity and facilitates the development of different transformations independently on top of ODFM. We use the new implementation to evaluate the benefit of a specific transformation by parallelizing 47 unmodified shell scripts with and without this transformation and measuring their execution times. Finally, we present a case study in which we parallelize two scripts using GNU Parallel [Tange 2011 ]; our experience indicates that while it is easy to parallelize shell scripts using such a tool, it is also easy to introduce bugs, leading to incorrect results. In summary, this paper makes the following contributions: • Order-Aware Dataflow Model: It introduces the order-aware dataflow model (ODFM), a dataflow model tailored to the Unix shell that captures information about the order in which nodes consume inputs from different input edges ( §4). • Translations: It formalizes the bidirectional translations between shell and ODFM required to translate shell scripts into dataflow graphs and dataflow graphs back to parallel scripts ( §5). • Transformations and Proofs of Correctness: It presents a series of ODFM transformations for extracting data parallelism. It also presents proofs of correctness for these transformations ( §6). • Results: It reimplements the optimizing compiler of PaSh and presents experimental results that evaluate the speedup afforded by a specific dataflow transformation ( §7). The paper starts with an informal development building the necessary background ( §2) and expounding on Spell ( §3). It then presents the three main contributions outlined above ( §4-7), compares with prior work ( §8), and offers a discussion ( §9), before closing the paper ( §10). This section reviews background on commands and abstractions in the Unix shell. A key Unix abstraction is the data stream, operated upon by executing commands or processes. Streams are sequences of bytes, but most commands process them as higher-level sequences of line elements, with the newline character delimiting each element and the EOF condition representing the end of a stream. Streams are often referenced using a filename, that is an identifier in a global name-space made available by the Unix file-system such as /home/user/x. Streams and files in Unix are isomorphic: some streams can persist as files beyond the execution of the process, whereas other streams are ephemeral in that they only exist to connect the output of one process to the input of another process during their execution. The sequence order is maintained when changing between persistent files and ephemeral streams. Commands: Each command is an independent computation unit that reads one or more input streams, performs a computation, and produces one or more output streams. Contrary to languages with a closed set of primitives, there is an unlimited number of Unix commands, each one of which may have arbitrary behaviors-with the command's side-effects potentially affecting the entire environment on which it is executing. These commands may be written in any language or exist only in binary form, and thus Unix is not easily amenable to a single parallelizability analysis. Parallelization tools such as GNU parallel leave such analysis to developers that have to ensure that the script behavior will not be affected by parallelization, whereas transformation-based tools such as PaSh and POSH identify key invariants that hold for entire classes of commands and then resort to annotation libraries to infer whether each invariant is satisfied by each command. For example, an invariant that is used in both PaSh and POSH is whether a command is stateless, i.e., whether it maintains state when processing its inputs, or whether it processes each input line independently. Commands that satisfy this invariant can be parallelized by splitting their inputs in lines and then combining their outputs. Command flags: Unix commands are often configurable, customizing their behavior based on the task at hand. This is usually achieved via environment variables and command flags. Tools like PaSh and POSH address command behavior variability due to flags by including flags and command arguments in their annotation frameworks. Given command annotations, PaSh and POSH can abstract specific command invocations in a pipeline to black boxes for which some assumptions hold. This makes them applicable in the context of the shell where the space of possible command and flag combinations is exceedingly large. Order of input consumption: In Unix, all streams are ordered and all commands can safely assume that they can consume elements from their streams in the order they were produced. Additionally, most commands have the ability to operate on multiple files or streams. The order in which commands access these streams is important. In some cases, they read streams in the order of the stream identifiers provided. In other cases, the order is different-for example, an input stream may configure a command, and thus must be read before all the others. Consider for example grep -f words.txt input.txt, which first reads words.txt to determine the keywords for which it needs to search, and then reads input.txt line by line, emitting all lines that contain one of the words in words.txt. In other cases, reads from multiple streams are interleaved according to some command-specific semantics. Composition: Unix Operators: Unix provides several primitives for program composition, each of which imposes different scheduling constraints on the program execution. Central among them is the pipe (|), a primitive that passes the output of one process as input to the next. The two processes form a pipeline, producing output and consuming input concurrently and possibly at different rates. The Unix kernel facilitates program scheduling, communication, and synchronization behind the scenes. For example, Spell's first tr transforms each character in the input stream to lower case, passing the stream to the second tr: the two trs form a parallel producer-consumer pair of processes. Apart from pipes, the language of the Unix shell provides several other forms of program composition-e.g., the sequential composition operator (;) for executing one process after another has completed, and control structures such as if and while. All of these constructs enforce execution ordering between their components. To preserve such ordering and thus ensure correctness, systems such as PaSh and POSH do not "push" parallelization beyond these constructs. Instead, they focus on exploiting parallelism in script regions that do not face ordering constraints-which, as they demonstrate, is enough to significantly improve the performance of scripts found out in the wild [Raghavan et al. 2020; Vasilakis et al. 2021 ]. This section provides intuition of the order-aware dataflow model proposed in this paper by following the different phases of a shell-to-shell parallelizing compiler (inspired by PaSh and POSH), formalized in the later sections. Given a script such as Spell ( §1), the compiler identifies its dataflow regions, translates them to DFGs (Shell→ODFM), applies graph transformations that expose data parallelism on these DFGs, and replaces the original dataflow regions with the nowparallel regions (ODFM→Shell). Shell→ODFM: Provided a shell script, the compiler starts by identifying subexpressions that are potentially parallelizable. The first step is to parse the script, creating an abstract syntax tree like the one presented on the right. Here we omit any non-stream flags and refer to all the stages between (and including) tr and sort as a dotted edge ending with cat. The compiler then identifies parallelism barriers within the shell script: these barriers are operators that enforce synchronization constraints such as the sequential composition operator (";"). We call any set of commands that does not include a dataflow barrier a dataflow region. Dataflow regions are then transformed to dataflow graphs (DFGs), i.e., instances of our order-aware dataflow model. In our example, there are two dataflow regions corresponding to the following dataflow graphs: As mentioned earlier ( §2), the compiler exposes parallelism in each DFG separately to preserve the ordering requirements imposed to ensure correctness. For the rest of this section we focus on the parallelization of DFG1. Command Aggreg. Function cat cat $* tr A-Z a-z cat $* tr -d a cat $* sort sort -m $* uniq uniq $* grep -f a -cat $* wc -l paste -d+ $*|bc sed 's/a/b/' cat $* Parallelizable Commands: Individual nodes of the dataflow graphs are shell commands. Systems like PaSh and POSH assume key information for individual commands, e.g., whether they are amenable to divide-and-conquer data parallelism. Such data parallelism is achieved by splitting the input into pieces (at stream element boundaries), processing partial inputs in parallel, and finally applying an aggregation function to partial outputs to produce the final output. This decomposition breaks a command into two components-a data-parallel function, which is often the command itself, and an aggregation function. The table on the right presents aggregation functions for the shell commands in our example (all of which are parallelizable). For example, consider the decomposition of the tr command. Applying tr over the entire input produces the same result as splitting the input into two, applying tr to the two partial inputs, and then merging the partial results with a cat aggregation function. Note that both split and cat are order-aware, i.e., split sends the first half of its input to the first tr and the rest to the second, while cat concatenates its inputs in order. This guarantees that the output of the DFG is the same as the one before the transformation. Parallelization Transformations: Given the decomposition of individual commands, the compiler's next step is to apply graph transformations to exploit parallelism present in the computation represented by the DFG. As each parallelizable Unix command comes with a corresponding aggregation function, the compiler's transformations first convert the DFG into one that exploits parallelism at each stage. After applying the transformation to the two tr stages, the DFG looks as follows: After these transformations are applied to all DFG nodes, the next transformation pass is applied to pairs of cat and split nodes: whenever a cat is followed by a split of the same width, the transformation removes the pair and connects the parallel streams directly to each other. The goal is to push data parallelism transformations as far down the pipeline as possible to expose the maximal amount of parallelism. Here is the resulting DFG for the transformation applied to the two tr stages: The next node to parallelize is sort. To merge the partial output of parallel sorts, we need to apply a sorted merge. (In GNU systems, this is available as sort -m so we use this as the label of the merging node.) The transformation then removes cat, replicates sort, and merges their outputs with sort -m: cat As mentioned earlier, a similar pass of iterative transformations is applied to DFG2, but the two DFGs are not merged to preserve the synchronization constraint of the dataflow barrier ";". Order Awareness: Data-parallel systems [Dean and Ghemawat 2008; Zaharia et al. 2012 ] often achieve parallelism using sharding, i.e., partitioning input based on some key, or using shuffling, i.e., arbitrary partitioning of inputs to parallel instances of an operator. However, these techniques cannot be directly applied to the context of the shell, since (1) Unix commands and pipelines assume strict ordering of their input elements, (2) most commands are not independent on the basis of some key (to enable sharding), and (3) many commands are not commutative (e.g., uniq, cat -n). Since our goal is to define a model that applies directly to existing shell scripts, we cannot simply introduce new primitives that support sharding or shuffling, as is done in the case of systems that design an abstraction that fits their needs (e.g., MapReduce, Spark). Thus, data parallelism in the shell requires a careful treatment of input and output ordering. To further explain the need for order-awareness in a model for data parallel Unix pipelines, let's look at the following examples. Consider Spell's cat f1.md f2.md command that starts reading from f2.md only after it has completed reading f1.md; note that any or both input streams may be pipes waiting for results from other processes. This order can be visualized as a label over each input edge. Correctly parallelizing this command requires ensuring that parallel cat (and possibly followup stages) maintains this order. grep cat uniq uniq 1 2 1 2 dict As a more interesting example, consider Spell's grep, whose DFG is shown on the right. Parallelizing grep without taking order into account is not trivial, because grep -vx -f's set difference is not commutative: we cannot simply split its input streams into two pairs of partial inputs fed into two copies of grep. Taking input ordering into account, however, highlights an important dependency between grep's inputs. The dict stream can be viewed as configuring grep, and thus grep can be modeled as consuming the entire dict stream before consuming partial inputs. Armed with this insight, the compiler parallelizes grep by passing the same dict.txt stream to both grep copies. This requires an intermediary tee for duplicating the dict.txt stream to both copies of grep, each of which consumes the stream in its entirety before consuming the results of the preceeding uniq. Order-awareness is also important for the DFG translation back to a shell script. In the specific example we need to know how to instantiate the arguments of each grep of all possible options-e.g., grep -vx -f p1 p2, cat p1 | grep -vx -f -p2, etc. Aggregators are Unix commands with their own ordering characteristics that need to be accounted for. The order of input consumption in the examples of this section is statically known and can be represented for each node as a set of configuration inputs, plus a sequence of the rest of its inputs. To accurately capture the behavior of shell programs, however, ODFM is more expressive, allowing any order of input consumption. The correctness of our parallelization transformations is predicated upon static but configurable orderings: a command reads a set of configuration streams to setup the consumption order of its input streams which are then consumed in-order, one after the other. ODFM→Shell: The transformed graph is finally compiled back to a script that uses POSIX shell primitives to drive parallelism explicitly. A benefit of the dataflow model is that it can be directly implemented on top of the shell, simply translating each node to a command, and each edge to a stream. The generated parallel script for Spell can be seen below. The two DFGs are compiled into the two fragments that start with mkfifo and end with rm. Each fragment uses a series of named pipes (FIFOs) to explicitely manipulate the input and output streams of each data-parallel instance, effectively laying out the structure of the DFG using explicit channel naming (Unix FIFOs are named in the filesystem similar to normal files.) Aggregation functions are used to merge partial outputs from previous commands coming in through multiple FIFOs-for example, sort -m t4 t6 and cat t11 t12 for the first fragment, and paste -d+ t2 t3 | bc and cat t7 t8 for the second. A wait blocks until all commands executing in parallel complete. The parallel script is simplified for clarity of exposition: it does not show the details of input splitting, handling of SIGPIPE deadlocks, and other technical details that are handled by the current implementation. Readers might be wondering about the correctness of having two sed commands in the parallel script: won't the string "mispelled words" appear twice in the output? Note, however, that the output of the wc stage (fifo t4) contains a single line. As a result, the second sed will not be given any input line and thus will not produce any output. In this section we describe the order-aware dataflow model (ODFM) and its semantics. As discussed earlier ( §2), the two main shell abstractions are (i) data streams, and (ii) commands communicating via streams. We represent streams as named variables and commands as functions that read from and write to streams. We first introduce some basic notation formalizing data streams on which our dataflow description language works. For a set , we write * to denote the set of all finite words over . For words , ∈ * , we write · or to denote their concatenation. We write for the empty word and ⊥ for the End-of-File condition. We say that is a prefix of , and we write ≤ , if there is a word such that = . The ≤ order is reflexive, antisymmetric, and transitive (i.e., it is a partial order), and is often called the prefix order. We use the notation * · ⊥ to denote a closed stream, abstractly representing a file/pipe stream that has been closed, i.e., one which no process will open for writing. The notation * is used to denote an open stream, abstractly representing an open pipe. Later, other process may add new elements at the end of this value. In the rest of out formalization we focus on terminating streams, and therefore terminating programs, since all of the data processing scripts that we have encountered are terminating. We discuss extensions for infinite streams in §9. Figure 1 presents the Dataflow Description Language (DDL) for defining dataflow graphs (DFG). A program ∈ in DDL is of the form I; O; E. I and O represent sets of edges, vectors of the form = ⟨ 1 , 2 , . . . ⟩. Variables 1 , 2 , . . . represent DFG edges, i.e., streams used as a communication channel between DFG nodes and as the input and output of the entire DFG. I is of the form input , where is the set of the input variables. Each variable ∈ I represents a file file( ) that is read from the Unix filesystem. Note that multiple input variables can refer to the same file. O is of the form output , where is the set of output variables. Each variable ∈ O represents a file file( ) that is written to the Unix filesystem. E represents the nodes of the DFG. A node ← ( ) represents a function from list of input variables (edges) to output variables (edges) . We require that is monotone with respect to a lifting of the prefix order for a sequence of inputs; that is, This captures the idea that a node cannot retract output that it has already produced. We wrap all functions with an execution wrapper · that ensures that all outputs of are closed when its inputs are closed: Step This is helpful to ensure termination. From now on, we only refer to the wrapped function semantics. We also assume that commands do not produce output if they have not consumed any input, i.e., the following is true: A variable in DDL is assigned only once and consumed by only one node. DDL does not allow the dataflow graph to contain any cycles. This also holds for variables in I and O, which cannot refer to the same variables in I and never assigned a different value in E. Similarly, variables in O are not read by any node in E. All variables which are not included in I and O abstractly represent temporary files/pipes which are created during the execution of a shell script. We assume that within a dataflow program, all variables are reachable from some input variables. Execution Semantics: Figure 2 presents the small step execution semantics for DDL. Maps Γ associates variable names to the data contained in the stream it represents. Map associates variable names to the data in the stream that has already been processed-representing the readonce semantics of Unix pipes. Let ⟨ ′ 1 , . . . ′ ⟩ ← ( 1 , . . . ) be a node in our DFG program. The function choice represents the order in which a commands consumes its inputs by returning a set of input indexes on which the function blocks on waiting to read. For example, the choice function for the command cat always returns the next non-closed index-as cat reads its inputs in sequence, each one until depletion. For a choice function to be valid, it has to return an input index that has not been closed yet. Formally, We assume that the set returned by choice cannot be empty unless all input indexes are closed, meaning that all nodes consume all of their inputs until depletion even if they do not need the rest of it for processing. The small step semantics nondeterministically picks a variable , such that ∈ choice ( 1 , . . . ), i.e., is waiting to read some input from , and ( ) < Γ( ), i.e., there is data on the stream represented by variable that has to be processed. The execution then retrieves the next message to process, and computes new messages 1 , . . . to pass on to the output streams ′ 1 , . . . ′ . Note that any of these messages (input or output) might be ⊥. We pass • , which denotes that the previous data is now being combined with the new message , to function . For all functions and new messages , given ⟨ ′ 1 , ′ 2 , . . . ′ ⟩ = ( 1 , . . . , , . . . ) we assume the following constraint holds: This constraint ensures that first processing arguments 1 , . . . , . . . and then message append to argument ℎ stream is equivalent to processing messages 1 , . . . · , . . . at once. Having this property, allows our system to process messages as they arrive or wait for all the messages to arrive, without changing the semantics of the execution. The messages 1 , . . . are passed on to their respective output streams (by updating Γ). Note that the size of the output messages could vary, and they could even be empty. Finally, is updated to denote that has been processed. Execution: Let ⟨I, O, E⟩ be a dataflow program, where I = input are the input variables, and output are the output variables. Let be the initial mapping from all variable names in the dataflow program ⟨I, O, E⟩ to empty string . Let Γ be the initial mapping for variables in the dataflow program, such that all non-input variables ∉ , map to the empty string Γ ( ) = . In contrast, all input variables ∈ , i.e., files already present in the file system, are mapped to the contents of the respective input file Γ ( ) = · ⊥. When no more small step transitions can take place (i.e., all commands have finished processing), the dataflow execution terminates and the contents of output variables in O can be written to their respective output files. Figure 3 represents the constraint that has to be satisfied by Γ at the end of execution, i.e., when all variables are processed. We now prove some auxiliary theorems and lemmas to show that dataflow programs always terminate and that when they terminate, the constraint in Figure 3 holds. During any point within the execution of the DFG, the following statement is true: Proof by induction on the number of execution steps. Base Case: Let Γ and be the initial mappings. For ∈ ⟨ 1 , . . . ⟩, ( ) = . For ∈ ⟨ ′ 1 , . . . ′ ⟩, Γ ( ) = (since ′ 1 , . . . ′ are not input variables to the DFG, they will be initialized to ). The following property is true for all functions : Therefore, or the initial mappings, Γ and , . . ( )) Induction Hypothesis: Let Γ and be a snapshot of stream mappings during the execution of the DFG such that the following statement is true: Induction Case: Let Γ and be a snapshot of stream mappings such that the induction hypothesis is true. Let Γ ′ and ′ be the snapshot after a single step of the execution takes place, given the snapshots Γ and . If for all ∈ [1, ], ( ) = ′ ( ), then a message updating ′ 1 , . . . ′ was not processed (they can only be written by this node). Therefore, for all ∈ [1, ].Γ( ′ ) = Γ ′ ( ′ ) and the following statement is true (assuming Induction Hypothesis) If there exists an ∈ [1, ] such that ′ ( ) = ( ) · , ≠ , then a message was processed. Note that the above statement can only be true for a single . For all ≠ , ( ) = ′ ( ). The following statement is true from Induction Hypothesis: . . ( )) and from small step semantics, for all ∈ [1, ]: Using the definition of · , the following statement is true: . . ( )) Therefore, the following is true: . . ′ ( )) The small step semantics will preserve this property. Therefore, by induction, for all ⟨ ′ 1 , ′ 2 , . . . , ′ ⟩ ← ( 1 , . . . ) ∈ E, the following is always true about Γ and , during any point within the execution: Proof. If for all ∈ [1, ], Γ( ) is closed and choice is non empty unless ( ) is closed for all , then eventually the execution will take a step to update till, for all ∈ [1, ], ( ) is closed. When all inputs are closed, · dictates that all outputs will be closed as well. Using theorem 4.1, Γ( ′ ) will be closed. □ Theorem 4.3. Eventually for all variables , ∃ .Γ( ) = · ⊥, i.e., all variables will eventually be closed. Proof. Let C be the set of variables which will be closed eventually. Note that I ⊆ C (all input variables to the DFG will be eventually closed). Using Lemma 4.2, for any node ⟨ ′ 1 , ′ 2 , . . . , ′ ⟩ ← ( 1 , . . . ) ∈ E, if 1 , . . . ∈ C, then ′ 1 , . . . ′ ∈ C. Since the dataflow program contains no cycles, eventually all variables reachable from the input variables are in C. □ Theorem 4.4. The Dataflow program will always terminate. Let Γ and be the stream mappings when the DFG terminates. The for Γ and , the constraint 3 will be true. Proof. The DFG graph terminates when all variables are closed. From Theorem 4.3, all variables will eventually be closed. Constraint 3 follows from Theorem 4.1, all variables being closed when DFG terminates, and the properties of · . □ This section formalizes the translations between the shell and our order-aware dataflow model. Given a shell script the compiler starts by recursing on the AST, replacing subtrees in a bottom-up fashion with dataflow programs. Fig. 4 shows a relevant subset of shell syntax, adapted from Smoosh [Greenberg and Blatt 2020] . Intuitively, some shell constructs (such as pipes |) allow for the composition of the dataflow programs of their components, while others (such as ;) prevent it. Figure 5 shows the translation rules for some interesting constructs, and Figure 6 shows several auxiliary relations that are part of this translation. We denote compilation from a shell AST to a shell AST as ↑ ′ , and compilation to a dataflow program as ↑ ⟨ , ⟩ where is a dataflow program and ∈ {bg, fg} denotes whether the program is to be executed in the foreground or background . The first two rules CommandTrans and CommandId describe compilation of commands. The bulk of the work is done in cmd2node, which, when possible, defines a correspondence between a command and a dataflow node. Predicate pure indicates whether the command is pure, i.e., whether it only interacts with its environment by reading and writing to files. All commands that we have seen until now (grep, sort, uniq) satisfy this predicate. The relations ins and outs define a correspondence between a commands arguments and the nodes inputs and outputs. We assume that a variable is uniquely identified from the file that it refers too, therefore if two variables have the same name, then they also refer to the same files. Finally, relation func extracts information about the execution of the command (such as its choice function and w) to be able to reconstruct it later on. Note that the four relations pure, ins, outs, and func act as axioms, and the soundness of our model and translations depends on their correctness. Prior work [Raghavan et al. 2020; Vasilakis et al. 2021 ] has shown how to construct such relations for specific commands using annotation frameworks, with PaSh providing annotations for more than 50 commands in POSIX and GNU Coreutils-two large and widely used sets of commands. The rule BackgroundDfg sets the background flag for the underlying dataflow program; if the operand of a & is not compiled to a dataflow program then it is simply left as is. The last part holds for all shell constructs, we currently only create dataflow nodes from a single command. The next set of rules refer to the sequential composition operator ";". This operator acts as a dataflow barrier since it enforces an execution ordering between its two operands. Because of that, it forces the dataflow programs that are generated from its operands to be optimized (with opt) and then compiled back to shell scripts (with ⇓). However, there is one case (SeqBothBg) where a dataflow region can propagate through a ";" and that is if the first component is to be executed in the background. In this case ";" does not enforce an execution order constraint between its two cmd2node( , ← ( )) add_metadata( , , ) = ′ redir( , , , ′ , ′ ) ↑ ⟨input ′ ; output ′ ; ′ ← ′ ( ′ ), fg⟩ CommandTrans operands and the generated dataflow programs can be safely composed into a bigger one. The rules for "&&" and "||" are similar (omitted). The relation compose unifies two dataflow programs by combining the inputs of one with the outputs of the other and vice versa. Before doing that, it ensures that the composed dataflow graph will be valid by checking that there is at most one reader and one writer for each internal and output variable, as well as all the rest of the dataflow program invariants, e.g., the absence of cycles ( §4). The remaining rules (not shown) introduce synchronization constraints and are not part of our parallelization effort-for example, we consider all branching operators as strict dataflow barriers. Figure 7 presents the compilation ⇓ of a dataflow program = I; O; E to a shell program. The compilation can be separated in a prologue, the main body, and an epilogue. The prologue creates a named pipe (i.e., Unix FIFO) for every variable in the program. Named pipes are created in a temporary directory using the mkfifo command, and are similar in behavior to ephemeral pipes except that they are explicitly associated to a file-system identifier-i.e., they are a special file in the file-system. Named pipes are used in place of ephemeral pipes (|) in the original script. (vars(E 1 ) \ vars(I 1 )) ∩ (vars(E 2 ) \ vars(I 2 )) = ∅ I The epilogue inserts a wait to ensure that all the nodes in the dataflow have completed execution, and then removes all named pipes from the temporary directory using rm. The design of the prologue-epilogue pair mimics how Unix treats ephemeral pipes, which correspond to temporary identifiers in a hidden file-system. The main body expresses the parallel computation and can also be separated into three components. For each of the input variables ∈ I, we add a command that copies the file f = file( ) to its designated pipe. Similarly, for all output variables ∈ O we add a command that copies the designated pipe to the output file in the filesystem f = file( ). Finally, we translate each node in E to a shell command that reads from the pipes corresponding to its input variables and writes to the pipes corresponding to its output variables. In order to correctly translate a node back to a command, we use the node-command correspondence functions (similar to the ones for ↑) that were used for the translation of the command to a node. Since a translated command might get its input from (or send its output to) a named pipe, we need to also add those as new redirections with in_out. For example, for a node 3 ← ( 1 , 2 ) that first reads 1 and then reads 2 , where = grep -f, the following command would be produced: grep -f p1 p2 > p3 & In this section we define a set of transformations that expose data parallelism on a dataflow graph. We start by defining a set of helper DFG nodes and a set of auxiliary transformations to simplify the graph and enable the parallelization transformations. Then we identify a property on dataflow nodes that indicates whether the node can be executed in a data parallel fashion. We then define the parallelization transformations and we conclude with a proof that applying all of the transformations preserves the semantics of the original DFG. Before we define the parallelization transformations, we introduce several helper functions that can be used as dataflow nodes. The first function is split. split takes a single input variable (file or pipe) and sequentially splits into multiple output variables. The exact size of the data written in each output variable is left abstract since it does not affect correctness but only performance. ← split( ) = ⟨ 1 · ⊥, 2 · ⊥, . . . −1 · ⊥, , , . . . ⟩, The second function is cat, which coincidentally behaves the same as the Unix command cat. cat, given a list of input variables, combines their values and assigns it to a single output variable. Formally cat is defined below: The third function is tee, which behaves the same way as the Unix command tee, i.e. copying its input variable to several output variables. Formally tee is defined below: The final function is relay. relay works as an identity function. Formally relay is defined below: ← relay( ), ∀ . relay( ) = , ∀ . relay( · ⊥) = · ⊥ Using these helper nodes our compiler performs a set of auxiliary transformations that are depicted in Figure 8 . relay acts an identity function, therefore any edge can be transformed to include a relay between them. Spliting in multiple stages to get edges is the same as splitting in one step into edges. Similarly, combining edges in multiple stages is the same as combining edges in a single stage. If we split an edge into edges, and then combine the edges back, this behaves as an identity. A cat can be pushed following a tee by creating copies of the tee function. If a cat has single incoming edge, we can convert it into a relay. If a split has a single outgoing edge, we can convert it into a relay. The first seven transformations can be performed both ways. The last transformations is one way. A split after a cat can be converted into relays, if the input arity of cat is the same as output arity of split. The reverse transformation in this case is not allowed as, using (Relay) rule, we can cat and split any two or more streams in the dataflow graph. This will allow us to pass the output of any function in our graph to any other function as an input. This will break the semantics of our Dataflow graph. The dataflow model exposes task parallelism as each different node can execute independentlyonly communicating with the other nodes through their communication channels. In addition to that, it is possible to achieve data parallelism by executing some nodes in parallel by partitioning part of their input. We are interested in nodes that produce a single output and consume their inputs in sequence (one after the other when they are depleted), after having consumed the rest of their inputs as an initialization and configuration phase. Note that there are several examples of shell commands that correspond to such nodes, e.g. grep, sort, grep -f, and sha1sum. Let such a node ′ = ( 1 , . . . , + ), where w.l.o.g. 1 , 2 , . . . , represent the configuration inputs and +1 , . . . , + represent the sequential consumption inputs. The consumption order of such a command is shown below: If we know that a command satisfies the above property we can safely transform it to a = cat( +1 , . . . , + ) followed by a command ′ = ′ ( , 1 , . . . , ), without altering the semantics of the graph. Data Parallel Nodes: We now shift our focus to a subset of the sequential consumption nodes, namely those that can be executed in a data parallel fashion by splitting their inputs. These are nodes that can be broken down in a parallel map and an associative aggregate . Formally, these nodes have to satisfy the following equation: We denote data parallel nodes as dp( , , ) Example of such a node that satisfies this property is the sort command, where = sort and = sort -m. In addition to the above equation, a map function should not output anything when its input closes. ( , ) = ( · ⊥, ) Note that could have multiple outputs and be different than the original function . As has been noted in prior research [Farzan and Nicolet 2017] this is important as some functions require auxiliary information in the map phase in order to be parallelized. An important observation is that a subset of all data parallel nodes are completely stateless, meaning that = and = cat, and therefore are embarrasingly parallel. We can now define a transformation on any data parallel node , that replaces it with a map followed by an aggregate. This transformation is formally shown in Figure 9 . Essentially, all the sequential consumption inputs (that are concatenated using cat) are given to different nodes the outputs of which are then aggregated using while preserving the input order. Note that the configuration inputs have to be duplicated using tee to ensure that all parallel s and s will be able to read them in case they are pipes and not files on disk. Using the auxiliary transformations-by adding a split followed by cat before a data parallel node, we can always parallelize them using the parallelization transformation. where Γ , Γ ′ are the initial mappings for and ′ respectively, and Γ , Γ ′ are the mappings when and ′ have completed their execution. Theorem 6.1. Let = ⟨I, O, E ∪ ⟩ and ′ = ⟨I, O, E ∪ ′ ⟩ be two dataflow programs. Let S be the set of input variables in node set (variables read in but not assigned inside ). Let S be the set of output variables in the node set (variables assigned in but not read inside ). Let S ′ , S ′ be the input variables and output variables of ′ . We assume S = S ′ and S = S ′ . If ⟨S , S , ⟩ is equivalent to ⟨S , S , ′ ⟩, then program ⟨I, O, E ∪ ′ ⟩ is equivalent to ⟨I, O, E ∪ ⟩. Proof. Given any initial mapping Γ , let Γ , Γ ′ be the mappings when and ′ complete their execution. For all ∈ S , Γ ( ) = Γ ′ ( ) as there are no cycles in the dataflow graph, and the subgraph which computes S is same in both and ′ . Since ⟨S , S , ⟩ is equivalent to ⟨S , S , ′ ⟩, and for all ∈ S , Γ ( ) = Γ ′ ( ), for all ∈ S , Γ ( ) = Γ ′ ( ). Only variables in set S are the variables assigned in and ′ , that are used in computing the value of the output variables O. Since the value of the variables is same in both these programs, given the same input mapping Γ , for all output variables ∈ O, Γ ( ) = Γ ′ ( ). Therefore, both these programs are equivalent. □ Theorem 6.2. Transformations presented in Figure 8 and Figure 9 preserve program equivalence. Proof. The (Relay) transformation preserves program equivalence as the program terminates, the value of is equal to the value of . The remaining transformations, transforming an input program ⟨I, O, E ∪ ⟩ to an output program ⟨I ′ , O ′ , E ∪ ′ ⟩. For all transformations S = S ′ and S = S ′ (where S , S ′ , S , S ′ are defined above). First seven transformations, equivalence of programs ⟨S , S , ⟩ and ⟨S , S , ′ ⟩ follow from the execution semantics for cat, relay, split, tee, the properties of and for data parallel commands . The (Concat-Split) transformation relies on the additional property that the program produces the same output independent of how the split breaks the input stream. Choice of a particular way of breaking the stream does not change the value of the program's output variables when it terminates. Since, ⟨S , S , ⟩ is equivalent to ⟨S , S , ′ ⟩, these transformations preserve equivalence (Theorem 6.1). □ Our evaluation consists of two parts. The first part is a case study of applying GNU Parallel to two scripts, demonstrating the difficulty of manually reasoning about parallel shell pipelines and the challenges that one has to address in order to achieve a parallel implementation. The second part demonstrates the performance benefits of our transformations on 47 unmodified shell scripts. Before discussing our evaluation, we offer a brief outline of the compiler implementation. Implementation: We reimplement the compilation and optimization phases of PaSh [Vasilakis et al. 2021 ] according to our model and associated transformations. The new implementation is about 1500 lines of Python code and uses the order-aware dataflow model as the centerpiece intermediate representation. It is also more modular and facilitates the development of additional transformations, closely mirroring the back-and-forth shell-to-ODFM translations described in Section 5 and the parallezing transformations described in Section 6. While we expect that most users would use PaSh by writing shell scripts, completely ignoring the ODFM, it is also possible to manually describe programs in the intermediate representation, enabling other frontend and backend frameworks to interface with it. By reimplementing PaSh's optimization phase to mirror our transformations we also discovered and solved a bug in PaSh. The old implementation did not tee the configuration inputs of a parallelized command, but rather allowed all parallel copies to read from the same input. While this is correct if the configuration input is a file on disk, the semantics indicated that in the general case it leads to incorrect results-for example, in cases where this input is a stream-because all parallel commands consume items from a single stream, only reading a subset of them. We describe an attempt to achieve data parallelism in two scripts using GNU parallel [Tange 2011 ], a tool for running shell commands in parallel. We chose GNU parallel because it compares favorably to other alternatives in the literature [Tange 2020 ], but note that GNU parallel sits somewhere between an automated compiler, like PaSh and POSH, and a fully manual approachillustrating only some of the issues that one might face while manually trying to parallelize their shell scripts. Spell: We first apply parallel on Spell's first pipeline ( §1): TEMP_C1="/tmp/{/}.out1" TEMP1=$(seq -w 0 $(($JOBS -1)) | sed 's+^+/tmp/in+' | sed 's/$/.out1/' | tr ' ' ' ') TEMP1=$(echo $TEMP1) mkfifo $TEMP1 parallel "cat {} | col -bx | tr -cs A-Za-z ' ' | tr A-Z a-z | \ tr -d '[:punct:]' | sort > $TEMP_C1" ::: $IN & sort -m $TEMP1 | parallel -k --jobs ${JOBS} --pipe --block "$BLOCK_SIZE" "uniq" | uniq | parallel -k --jobs ${JOBS} --pipe --block "$BLOCK_SIZE" "grep -vx -f $dict -" rm $TEMP1 It took us a few iterations to get the parallel version right, leading to a few observations. First, despite its automation benefits, parallel still requires manual placement of the intermediate FIFO pipes and functions. Additionally, achieving ideal performance requires some tweaking: setting --block to 10K, 250K, and 250M yields widely different execution times-27, 4, and 3 minutes respectively. Most importantly, omitting the -k flag in the last two fragments breaks correctness due to reordering related to scheduling non-determinism. These fragments are fortunate cases in which the -k flag has the desired effect, because their output order follows the same order as the arguments of the commands they parallelize. Other commands face problems, in that the correct output order is not the argument order nor an arbitrary interleaving. Set-difference: We apply parallel to Set-diff, a script that compares two streams using sort and comm: mkfifo s1 s2 TEMP_C1="/tmp/{/}.out1" TEMP1=$(seq -w 0 $(($JOBS -1)) | sed 's+^+/tmp/in+' | sed 's/$/.out1/' | tr ' ' ' ') TEMP1=$(echo $TEMP1) TEMP_C2="/tmp/{/}.out2" TEMP2=$(seq -w 0 $(($JOBS -1)) | sed 's+^+/tmp/in+' | sed 's/$/.out2/' | tr ' ' ' ') TEMP2=$(echo $TEMP2) mkfifo ${TEMP1} ${TEMP2} parallel "cat {} | cut -d ' ' -f 1 | tr [:lower:] [:upper:] | sort > $TEMP_C1" :: In addition to the issues highlighted in Spell, this parallel implementation has a subtle bug. GNU parallel spawns several instances of grep -vx -f s2that all read FIFO s2. When the first parallel instance exits, the kernel sends a SIGPIPE signal to the second sort -m. This forces sort to exit, in turn leaving the rest of the parallel grep -vx -f instances blocked waiting for new input. The most straightforward way we have found to address this bug is to remove (1) the "&" operator after the second sort -m, and (2) s2 from mkfifo. This modification sacrifices pipeline parallelism, as the first stage of the pipeline completes before executing grep -vx -f. The parallel pipeline modified for correctness completes in 4m54s. Our compiler does not sacrifice pipeline parallelism by using tee to replicate s2 for all parallel instances of grep -vx -f ( §2), completing in 4m7s. Methodology: We use three sets of benchmark programs from various sources, including GitHub, StackOverflow, and the Unix literature [Bentley 1985; Bentley et al. 1986; Bhandari 2020; Jurafsky 2017; McIlroy et al. 1978; Taylor 2004] . • Expert Pipelines: The first set contains 9 pipelines: NFA-regex, Sort, Top-N, WF, Spell, Difference, Bi-grams, Set-Difference, and Shortest-Scripts. Pipelines in this set contain 2-7 stages (mean: 5.2), ranging from a scalable CPU-intensive grep stage in NFA-regex to a non-parallelizable diff stage in Difference. These scripts are written by Unix experts: a few pipelines are from Unix legends [Bentley 1985; Bentley et al. 1986; McIlroy et al. 1978] , one from a book on Unix scripting [Taylor 2004] , and a few are from top Stackoverflow answers [Jurafsky 2017 ]. Parallel, and No Cat-Split, with 16×-parallelism. et al. 1978] . We found unofficial solutions to all-but-three problems on GitHub [Bhandari 2020 ], expressed as pipelines with 2-12 stages (mean: 5.58). They make extensive use of standard commands under a variety of flags, and appear to be written by non-experts-contrary to the previous set, they often use sub-optimal or non-Unix-y constructs. We execute each pipeline as-is, without any modification. • COVID-19 Mass-Transit Analysis Pipelines: The third set contains 4 pipelines that were used to analyze real telemetry data from bus schedules during the COVID-19 response in one of Europe's largest cities [Tsaliki and Spinellis 2021] . The pipelines compute several statistics on the transit system per day-such as average serving hours per day and average number of vehicles per day. Pipelines range between 9 and 10 stages (mean: 9.2) and use typical Unix staples such as sed, awk, sort, and uniq. We use our implementation of PaSh to parallelize all of the pipelines in these benchmark sets, working with three configurations: • Baseline: Our compiler simply executes the script using a standard shell (in our case bash) without performing any optimizations. This configuration is used as our baseline. Note that it is not completely sequential since the shell already achieves pipeline and task parallelism based on | and &. • No Cat-Split: Our compiler performs all transformations except Concat-Split. This configuration achieves parallelism by splitting the input before each command and then merging it back. It is used as a baseline to measure the benefits achieved by the Concat-Split transformation. • Parallel: Our compiler performs all transformations. The Concat-Split transformation, which removes a cat with inputs followed by a split with outputs, ensures that data is not merged unnecessarily between parallel stages. Experiments were run on a 2.1GHz Intel Xeon E5-2683 with 512GB of memory and 64 physical cores, Debian 4.9.144-3.1, GNU Coreutils 8.30-3, GNU Bash 5.0.3(1), OCaml 4.05.0, and Python 3.7.3. All pipelines are set to (initially) read from and (finally) write to the file system. For "Expert Pipelines", we use 10GB-collections of inputs from Project Gutenberg [Hart 1971 ]; for "Unix50 Pipelines", we gather their inputs from each level in the game [Labs 2019] and multiply them up to 10GB. For "Bus Route Analysis Pipelines" we use the real bus telemetry data for the year 2020 (~3.4GB). The inputs for all pipelines are split in 16 equal sized chunks, corresponding to the intended parallelism level. Fig. 10 shows the execution times on all programs with 16× parallelism and for all three configurations mentioned in the beginning of the evaluation. It shows that all programs achieve significant improvements with the addition of the Concat-Split transformation. The average speedup without Concat-Split over the bash baseline is 2.26×. The average speedup with the transformation is 6.16×. The figure on the right explains the differences in the effect of the transformation based on the kind of commands involved in the pipelines. It offers a correlation between sequential time and speedup, and shows that different programs that involve commands with similar characteristics (color) see similar speedups (y-axis). Programs containing only parallelizable commands see the highest speedup (10.4-14.5×). Programs with limited speedup either (1) contain sort, which does not scale linearly, (2) are not CPU-intensive, resulting in pronounced IO and constant costs, or (3) are deep pipelines, already exploiting significant pipeline-based parallelism. Programs with non-parallelizable commands see no significant change in execution time (0.9-1.3×). Finally, programs containing head have a very small sequential execution, typically under 1 , and thus their parallel equivalents see a slowdown due to constant costs-still remaining under 1 . Dataflow Graph Models: Graph models of computation where nodes represent units of computation and edges represent FIFO communication channels have been studied extensively [Dennis 1974; Kahn 1974; Kahn and MacQueen 1977; Karp and Miller 1966; Lee and Messerschmitt 1987a,b] . ODFM sits somewhere between Kahn Process Networks [Kahn 1974; Kahn and MacQueen 1977] (KPN), the model of computation adopted by Unix pipes, and Synchronous Dataflow [Lee and Messerschmitt 1987a,b] (SDF). A key difference between ODFM and SDF is that ODFM does not assume fixed item rates-a property used by SDF for efficient scheduling determined at compiletime. Two differences between ODFM from KPNs is that (i) ODFM does not allow cycles, and (ii) ODFM exposes information about the input consumption order of each node. This order provides enough information at compile time to perform parallelizing transformations while also enabling translation of the dataflow back to a Unix shell script. Systems for batch [Dean and Ghemawat 2008; Murray et al. 2013; Zaharia et al. 2012] , stream [Gordon et al. 2006; Mamouras et al. 2017; Thies et al. 2002] , and signal processing [Bourke and Pouzet 2013; Lee and Messerschmitt 1987a] provide dataflow-based abstractions. These abstractions are different from ODFM which operates on the Unix shell, an existing language with its own peculiarities that have guided the design of the model. One technique for retrofitting order over unordered streaming primitives such as sharding and shuffling is to extend the types of elements using tagging [Arvind and Nikhil 1990; Arvind et al. 1984; Watson and Gurd 1979] . This technique would not work in the Unix shell, because (1) commands are black boxes operating on stream elements in unconstrained ways (but in known order), and (2) because data streams exchanged between commands contain flat strings, without support for additional metadata extensions, and thus no obvious way to augment elements with tags. ODFM instead captures ordering on the edges of the dataflow graph, and leverages the consumption order of nodes (the choice function) in the graph to orchestrate execution appropriately. Synchronous languages [Berry and Gonthier 1992; Halbwachs et al. 1991; Le Guernic et al. 1986; Maraninchi and Rémond 2001] model stream graphs as circuits where nodes are state machines and edges are wires that carry a single value. Lustre [Halbwachs et al. 1991 ] is based on a dataflow model that is similar to ours, but its focus is different as it is not intended for exploiting data-parallelism. Semantics and Transformations: Prior work proposes semantics for streaming extensions to relational query languages based on dataflow [Arasu et al. 2006; Li et al. 2005] . In contrast to our work, it focuses on transformations of time-varying relations. More recently, there has been significant work on the correct parallelization of distributed streaming applications by proposing sound optimizations and compilation techniques [Hirzel et al. 2014; Schneider et al. 2013 ], type systems [Mamouras et al. 2019] , and differential testing [Kallas et al. 2020 ]. These efforts aim at producing a parallel implementation of a dataflow streaming computation using techniques that do not require knowledge of the order of consumption of each node-a property that very important in our setting. Recent work proposes a semantic framework for stream processing that uses monoids to capture the type of data streams [Mamouras 2020 ]. That work mostly focuses on generality of expression, showing that several already proposed programming models can be expressed on top of it. It also touches upon soundness proofs of optimizations using algebraic reasoning, which is similar to our approach. Divide and Conquer Decomposition: Prior work has shown the possibility of decomposing programs or program fragments using divide-and-conquer techniques Nicolet 2017, 2019; Rugina and Rinard 1999; Smith and Albarghouthi 2016] . The majority of that work focuses on parallelizing special constructs-e.g., loops, matrices, and arrays-rather than stream-oriented primitives. Techniques for automated synthesis of MapReduce-style distributed programs [Smith and Albarghouthi 2016] can be of significant aid for individual commands. In some cases Nicolet 2017, 2019] , the map phase is augmented to maintain additional metadata used by the reducer phase. These techniques complement our work, since they can be used to derive aggregators and the parallelizability properties of yet unknown shell commands, making them possible to capture in our model. Parallel Shell Scripting: Tools exposing parallelism on modern Unixes such as qsub [Gentzsch 2001] , SLURM [Yoo et al. 2003 ], AMFS [Zhang et al. 2013 ] and GNU parallel [Tange 2011 ] are predicated upon explicit and careful orchestration from their users. Similarly, several shells [Duff 1990; McDonald and Dix 1988; Spinellis and Fragkoulis 2017a; Walker et al. 2009 ] add primitives for non-linear pipe topologies-some of which target parallelism. Here too, however, users are expected to manually rewrite scripts to exploit these new primitives without jeopardizing correctness. Our work is inspired by PaSh [Vasilakis et al. 2021 ] and POSH [Raghavan et al. 2020] , two recent systems that use command annotations to parallelize and distribute shell programs by operating on dataflow fragments of the Unix shell. Our work is tied to PaSh [Vasilakis et al. 2021] , as it (i) uses its annotation framework for instantiating the correspondence of commands to dataflow nodes ( §5.1), and (ii) serves as its formal foundation since it reimplements all of its parallelizing transformations and proves them correct. POSH [Raghavan et al. 2020 ] too translates shell scripts to dataflow graphs and optimizes them achieving performance benefits, but its goal is to offload commands close to their data in a distributed environment. Thus, POSH only performs limited parallelization transformations, and focusing more on the scheduling problem of determining where to execute each command. It only parallelizes commands that require a concatenation combiner, i.e., a subset of the transformations that we prove correct in this work, and thus replacing its intermediate representation with our ODFM would be possible. POSH also proposes an annotation framework that captures several command characteristics. Some of these characteristics, such as parallelizability, are also captured by PaSh. Others are related to the scheduling problem-for example, whether a command such as grep produces output smaller than its input, making it a good candidate for offloading close to the input data. POSIX Shell Semantics: Our work depends on Smoosh, an effort focused on formalizing the semantics of the POSIX shell [Greenberg and Blatt 2020] . Smoosh focuses on POSIX semantics, whereas our work introduces a novel dataflow model in order to transform programs and prove the correctness of parallelization transformations on them. One of the Smoosh authors has also argued for making concurrency explicit via shell constructs [Greenberg 2018 ]. This is different from our work, since it focuses on the capabilities of the shell language as an orchestration language, and does not deal with the data parallelism of pipelines. Parallel Userspace Environments: By focusing on simplifying the development of distributed programs, a plethora of environments inadvertently assist in the construction of parallel software. Such systems [Barak and La'adan 1998; Mullender et al. 1990; Ousterhout et al. 1988 ], languages [Killian et al. 2007; Sewell et al. 2005; Virding et al. 1996] , or system-language hybrids [Epstein et al. 2011; Pike et al. 1990; Vasilakis et al. 2015 ] hide many of the challenges of dealing with concurrency as long as developers leverage the provided abstractions-which are strongly coupled to the underlying operating or runtime system. Even when these efforts are shell-oriented, such as Plan9's rc, they are backward-incompatible with the Unix shell, and often focus primarily on hiding the existence of a network rather than on modelling data parallelism. Command Annotations: The translation of shell scripts to dataflow programs is based on knowledge about command characteristics, such as the predicate pure that determines whether a command does not perform any side-effect except for writing to a set of output files ( §5.1). In the current implementation, this information is acquired through the annotation language and annotations provided by PaSh [Vasilakis et al. 2021 ]. An interesting avenue for future research would be to explore analyses for inferring or checking the annotations for commands. Such work could help extend the set of supported commands (which currently requires manual effort). Furthermore, it would be interesting to explore extensions to the annotation language in order to enable additional optimizations; for example, commands that are commutative and associative could be parallelized more efficiently, by relaxing the requirement for input order and better utilizing the underlying resources. Directly accessing the IR in the implementation: As described earlier ( §7), our implementation currently allows manually developing programs in the ODFM intermediate representation. However, this interface is not that convenient to use as an end-user since it requires manually instantiating each node of the graph with the necessary command metadata, e.g., inputs and outputs. It would be interesting future work to design different frontends that interface with this IR. For example, a frontend compiler from the language proposed by dgsh [Spinellis and Fragkoulis 2017a]; a shell that supports extended syntax for creating DAG pipelines. The IR could also act as an interface for different backends, for example one that implements ODFM in a distributed setting. Parallel Script Debugging: Debugging standard shell pipelines can be hard and it usually requires several iterations of trial and error until the user gets the script right. Our approach does not make the debugging experience any worse, as the system produces as output a parallel shell script, which can be inspected and modified like any standard shell script (as seen in §3). For example, a user could debug a script by removing a few stages of the parallel pipeline, or redirecting some intermediate outputs to permanent files for inspection. This is possible because of the expressiveness of ODFM and the existence of a bidirectional transformation from dataflow programs to shell scripts, which allows the compiler to simply use a standard shell such as bash as its backend. An approach that is particularly helpful, and which we have used ourselves, is to ask the compiler to add a relay node between every two nodes of the graph and then instantiate this relay node with an identity command that duplicates its input to its output and also a log file. This allows for stream introspection without affecting the behavior of the pipeline, facilitating debugging since the user can inspect all intermediate outputs at once. Stream Finiteness and Extensions: In our current model, parallelism is achieved by partitioning the finite stream, processing the partitions in parallel, and merging the results. Like PaSh and POSH, our model is designed to support terminating computations over large but finite data streams. All of the data processing scripts that we have encountered conform to this model and are terminating. One way to extend our work to support and parallelize infinite streams-such as the ones produced by yes and tail -f-would involve repeated applications of partitioning, processing, and merging. We presented an order-aware dataflow model for exploiting data parallelism latent in Unix shell scripts. The dataflow model is order-aware, accurately capturing the semantics of complex Unix pipelines: in our model, the order in which a node in the dataflow graph consumes inputs from different edges plays a central role in the semantics of the computation and therefore in the resulting parallelization. The model captures the semantics of transformations that exploit data parallelism available in Unix shell computations and prove their correctness. We additionally formalized the translations from the Unix shell to the dataflow model and from the dataflow model back to a parallel shell script. We implemented our model and transformations as the compiler and optimization passes of PaSh, a system parallelizing shell pipelines, and used it to evaluate the speedup achieved on 47 data-processing pipelines. While the shell has been mostly ignored by the research community for most of its 50-year lifespan, recent work [Greenberg and Blatt 2020; Greenberg et al. 2021a,b; Raghavan et al. 2020; Spinellis and Fragkoulis 2017b; Vasilakis et al. 2021] indicates renewed community interest in shell-related research. We view our work partly as providing the missing correctness piece of the shell-optimization work done by the systems community [Raghavan et al. 2020; Vasilakis et al. 2021] , and partly as a stepping stone for further studies on the dataflow fragment of the shell, e.g., the development of more elaborate transformations and optimizations. The CQL continuous query language: semantic foundations and query execution Executing a program on the MIT tagged-token dataflow architecture The tagged token dataflow architecture The MOSIX multicomputer operating system for high performance cluster computing Programming Pearls: A Spelling Checker Programming Pearls: A Literate Program The Esterel synchronous programming language: Design, semantics, implementation Solutions to unixgame Zélus: A synchronous language with ODEs MapReduce: Simplified Data Processing on Large Clusters First Version of a Data Flow Procedure Language Rc-A shell for Plan 9 and Unix systems Towards Haskell in the Cloud Synthesis of Divide and Conquer Parallelism for Loops Modular Divide-and-Conquer Parallelization of Nested Loops Sun grid engine: Towards creating a compute power grid Exploiting coarse-grained task, data, and pipeline parallelism in stream programs The POSIX shell is an interactive DSL for concurrency Executable Formal Semantics for the POSIX Shell: Smoosh: the Symbolic, Mechanized, Observable, Operational Shell The Future of the Shell: Unix and Beyond Unix Shell Programming: The Next 50 Years The synchronous data flow programming language LUSTRE A catalog of stream processing optimizations Unix for Poets The Semantics of a Simple Language for Parallel Programming Coroutines and Networks of Parallel Processes. Information Processing DiffStream: differential output testing for stream processing programs Properties of a Model for Parallel Computations: Determinacy, Termination, Queueing Mace: Language Support for Building Distributed Systems The Unix Game-Solve puzzles using Unix pipes Signal-A data flow-oriented language for signal processing Static scheduling of synchronous data flow programs for digital signal processing Synchronous data flow. Proc. IEEE Semantics and evaluation techniques for window aggregates in data streams Semantic Foundations for Deterministic Dataflow and Stream Processing StreamQRE: Modular specification and efficient evaluation of quantitative queries over streaming data Data-Trace Types for Distributed Stream Processing Systems Argos: an automaton-based synchronous language Support for graphs of processes in a command interpreter Amoeba: A distributed operating system for the 1990s Naiad: A Timely Dataflow System The Sprite network operating system {POSH}: A Data-Aware Shell Automatic Parallelization of Divide and Conquer Algorithms Safe data parallelism for general streaming Acute: High-level Programming Language Design for Distributed Computation MapReduce Program Synthesis Extending Unix Pipelines to DAGs Extending Unix Pipelines to DAGs GNU Parallel-The Command-Line Power Tool. ;login: The USENIX Magazine DIFFERENCES BETWEEN GNU Parallel AND ALTERNATIVES Wicked Cool Shell Scripts: 101 Scripts for Linux, Mac OS X, and Unix Systems StreamIt: A language for streaming applications The real statistics of buses in Athens Konstantinos Mamouras, Achilles Benetopoulos, and Lazar Cvetković. 2021. PaSh: Light-Touch Data-Parallel Shell Processing From Lone Dwarfs to Giant Superclusters: Rethinking Operating System Abstractions for the Cloud Concurrent Programming in ERLANG (2Nd Ed.) Composing and Executing Parallel Data-Flow Graphs with Shell Pipes A prototype data flow computer with token labelling Slurm: Simple linux utility for resource management Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing Parallelizing the Execution of Sequential Scripts We thank Konstantinos Mamouras for preliminary discussions that helped spark an interest for this work, Dimitris Karnikis for help with the artifact, Diomidis Spinellis for benchmarks and discussions, Michael Greenberg and Jiasi Shen for comments on the presentation of our work, the anonymous ICFP reviewers and our shepherd Rishiyur Nikhil for extensive feedback, and the ICFP artifact reviewers for their comments that significantly improved the paper artifact. This research was funded in part by DARPA contracts HR00112020013 and HR001120C0191, and NSF award CCF 1763514. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect those of DARPA or other agencies. n f a -r e g e x s o r t to p -n w f s p e ll b i-g r a m s d if f e r e n c e s e t-d if f e r e n c e s h o r te s t-s c r ip ts u n ix 5 0 -0 u n ix 5 0 -1 u n ix 5 0 -2 u n ix 5 0 -3 u n ix 5 0 -4 u n ix 5 0 -5 u n ix 5 0 -6 u n ix 5 0 -7 u n ix 5 0 -8 u n ix 5 0 -9 u n ix 5 0 -1 0 u n ix 5 0 -1 1 u n ix 5 0 -1 2 u n ix 5 0 -1 3 u n ix 5 0 -1 4 u n ix 5 0 -1 5 u n ix 5 0 -1 6 u n ix 5 0 -1 7 u n ix 5 0 -1 8 u n ix 5 0 -1 9 u n ix 5 0 -2 0 u n ix 5 0 -2 1 u n ix 5 0 -2 2 u n ix 5 0 -2 3 u n ix 5 0 -2 4 u n ix 5 0 -2 5 u n ix 5 0 -2 6 u n ix 5 0 -2 7 u n ix 5 0 -2 8 u n ix 5 0 -2 9 u n ix 5 0 -3 0 u n ix 5 0 -3 1 u n ix 5 0 -3 2 u n ix 5