key: cord-0654614-8taxemr6
authors: Sena, Luiz; Song, Xidan; Alves, Erickson; Bessa, Iury; Manino, Edoardo; Cordeiro, Lucas; Filho, Eddie de Lima
title: Verifying Quantized Neural Networks using SMT-Based Model Checking
date: 2021-06-10
journal: nan
DOI: nan
sha: 8cc44cb722704bc185b26583b932205f04bc8ce3
doc_id: 654614
cord_uid: 8taxemr6

Artificial Neural Networks (ANNs) are being deployed for an increasing number of safety-critical applications, including autonomous cars and medical diagnosis. However, concerns about their reliability have been raised due to their black-box nature and apparent fragility to adversarial attacks. These concerns are amplified when ANNs are deployed on restricted system, which limit the precision of mathematical operations and thus introduce additional quantization errors. Here, we develop and evaluate a novel symbolic verification framework using software model checking (SMC) and satisfiability modulo theories (SMT) to check for vulnerabilities in ANNs. More specifically, we propose several ANN-related optimizations for SMC, including invariant inference via interval analysis, slicing, expression simplifications, and discretization of non-linear activation functions. With this verification framework, we can provide formal guarantees on the safe behavior of ANNs implemented both in floating- and fixed-point arithmetic. In this regard, our verification approach was able to verify and produce adversarial examples for $52$ test cases spanning image classification and general machine learning applications. Furthermore, for small- to medium-sized ANN, our approach completes most of its verification runs in minutes. Moreover, in contrast to most state-of-the-art methods, our approach is not restricted to specific choices regarding activation functions and non-quantized representations. Our experiments show that our approach can analyze larger ANN implementations and substantially reduce the verification time compared to state-of-the-art techniques that use SMT solving.

Artificial neural networks (ANNs) are soft computing models usually employed for regression, machine learning, decision-making, and pattern recognition problems [13] , which have been recently used to perform various safety-critical tasks. For instance, ANNs are employed for Covid-19 diagnosis [76] , and for performing steering commands in self-driving cars [98] . Unfortunately, in such contexts, incorrect classifications can cause serious problems. Indeed, adversarial disturbances can make ANNs misclassify objects, thus causing severe damage to users of safety-critical systems. For instance, Eykholt et al. [33] showed that noise and disturbances, such as graffiti on traffic signals, could result in target misclassification during the operation of computer vision systems. Moreover, given that ANNs are notorious for being difficult to interpret and debug, the whole scenario becomes even more problematic [68] , which then claims for techniques able to assess their structures and verify results and behaviors. For this reason, there is a growing interest in verification methods for ensuring safety, accuracy, and robustness for neural networks. The approaches for ANN verification may be divided into three groups: optimization [34, 48, 84, 93] , reachability [49, 54, 62, 92, 96, 100, 100] , and satisfiability [47, 51, 61, 75] .

On the one hand, optimization-based algorithms pose the safety verification problem as an optimization one, in which safety properties are usually treated as constraints, as described by Tjeng et al. [91] . The main difficulty of optimization methods, such as mixed-integer linear programming [15, 91, 93] , branch and bound [84] , and semi-definite programming [34] , is to deal with constraints that are non-linear and non-convex due to a network's complex structure and its activation functions. Indeed, it is still possible to employ dual optimization for simplifying those constraints and then obtain a convex problem [32] ; however, completeness tends to be lost due to relaxations. On the other hand, reachability-based approaches aim at computing the reachable set of an ANN by propagating input sets through it, layer-by-layer, while checking whether some unsafe state (violation) belongs or not to that same reachable set. The main advantage of those methods is that they are usually sound, i.e., if the algorithm indicates that a network is unsafe, its safety property is violated. However, the computational cost to compute exact reachable sets becomes unreasonable for more complex ANNs and more extensive input spaces. In order to avoid such a problem, a reachable set is over-approximated by using symbolic [61, 62, 96] and/or set-theoretic methods [92, 100] . Although those tools effectively reduce the computational cost of reachability sets, it is still challenging to over-approximate ANN's non-linear elements, particularly their activation functions. There are some symbolic techniques suitable for dealing with over-approximation of activation functions [49, 54] ; however, most of the approaches available in literature are only able to approximate piecewise-linear and rectified linear unit (ReLU) activation functions.

Finally, satisfiability modulo theories (SMT) encode both ANN and desired safety property into a single logic formula using a decidable fragment of first-order logic, and then check whether a counterexample exists. In this regard, only binarized neural networks [52, 81] can be encoded into boolean logic and verified with existing SAT solvers [20, 75] . More complex ANNs, whether implemented in floating-or fixed-point [63, 66] , the latter aiming at efficiency and simplicity, require the use of first-order logic instead of propositional logic to exploit more abstract and less expensive techniques to solve the problem at hand. For example, SMT solvers often integrate a simplifier, which applies standard algebraic reduction rules and contextual simplification to simplify the logical formula. Regarding these, several SMT-based approaches have been proposed [5, 39, 47, 51, 61, 62, 80] . While SMT background theories allow those approaches to model the semantic of neural operations exactly using word-level theories, the resulting verification problem is challenging to solve [80] . In this respect, quantization, i.e., a representation with a lower number of bits, has been proven to make this problem even computationally harder [47] . As a consequence, most existing approaches specialize in simple piecewise-linear activation functions [5, 39, 61] , focus on the floating-point scenario only [62] , or require domain-specific abstractions [51] .

Against this background, we propose a novel approach to verify both fixed-and floating-point ANN implementations. Our main idea is to look at the source code of an ANN rather than the abstract mathematical model behind it. By doing so, we can then leverage many recent advances in software verification that can dramatically increase the computational efficiency of verification processes, as observed in our experimental evaluation. More specifically, in this paper, we make the following original contributions:

• We cast the ANN verification problem into a software verification one. On the one hand, we propose a method to represent ANN safety properties as pairs of assume and assert instructions. On the other hand, we explain how to represent fixed-and floating-point operations in a quantized ANN, using direct implementations of their behavior, i.e., representations that consider a target precision. • We introduce several pre-processing steps to increase the efficiency of downstream software verification tools. Namely, we give a principled method to discretize non-linear activation functions and replace them with lookup tables. Furthermore, we show how to bound the feasible range of each variable with interval analysis and how to represent those bounds with additional assume instructions. • We detail which existing techniques for search-space reduction can be borrowed from the software verification literature, and we empirically evaluate their individual and cumulative effects. • We evaluate our approach on fixed-and floating-point ANNs and give empirical evidence on its computational efficiency. In particular, we show that we can verify ANNs with hundreds of neurons in less than an hour. • We compare our approach with state-of-the-art (SOTA) techniques, including quantized and floating-point tools. According to the comparison, since our method applies various optimization techniques before invoking the SMT solver, we have better performance than other SMT-based verification tools. Outline. In Section 2, we introduce the ANN verification problem and present existing satisfiability modulo theories. In Section 3, we detail all the steps involved in our code-level verification approach for ANNs. In Section 4, we empirically test our approach on ANN classifiers trained on the classic Iris dataset and an image recognition dataset. In Section 5, we give a broader review of the recent trends in verifying ANNs. In Section 6, we conclude and outline possible future work.

Before introducing the details of our verification approach, let us review some important concepts related to the verification of artificial neural networks.

Modern ANNs are universal function approximators built by composing multiple copies of the same basic building block, called neuron [13] . In other words, they provide a way of constructing system models with a set of sample observations, in such a way that the joint behavior of existing neurons is correctly adjusted. In their most common form, each neuron is itself the composition of two functions, as illustrated in Fig. 1 . The first one is an affine projection of the local inputs, often referred to as the activation potential . The second one is a non-linear transformation of the resulting potential, often referred to as activation function N . Together, they define the following mapping : R → R: (1) where

Finally, provides a way of directly shifting a given activation function. The behavior of the basic neuron in Fig. 1 depends on the values of its weights and also on the chosen activation function N . In this regard, researchers have experimented with a wide range of functions, including non-monotonic [78, 88] , non-continuous [13] , and unbounded ones [46, 74] . In our experiments, which are available in Section 4, we cover the most popular activation functions: namely, ReLU, sigmoid (Sigm), and the re-scaled version of the latter known as hyperbolic tangent (TanH) :

At the same time, one may notice that many state-of-the-art verification tools for ANNs are only compatible with ReLU and similar piece-wise linear activation functions [5, 39, 61] . Moreover, those that do support more activation functions [49, 54, 62] often incur a significant performance hit, when solving the resulting non-linear verification problem. In contrast, the discretization technique we propose in Section 3.3 allows us to efficiently verify ANNs with any form of activation function.

Besides, our verification methodology is general enough to be applied to a large variety of ANN architectures. Specifically, we support any feedforward, convolutional [65] , recurrent [42] , and graph neural network [99] that is built from the composition of the basic neuron model in Fig. 1 . Similar to what has been reported in existing ANN verification studies [10, 51, 61] , the primary factor influencing our verification time is the number of non-linearities, in a neural network, rather than its architecture (see Section 4).

As the deployment of ANNs in software applications becomes widespread, concerns about power consumption and complexity of large models increase. In this light, one of the main techniques to reduce energy requirements related to ANN inference is quantization [63] , which further restrict operations required to compute the output of each neuron (see 1 and 2) to integer [66] or even binary representations [52, 81] . State-of-the-art methods to perform such a transformation significantly improve the low-power feature of ANNs while retaining the original predictive accuracy [45] .

At the same time, the discretized nature of quantized neural networks (QNN) generates unique challenges regarding their verification [47] . More specifically, the output and intermediate computations performed by a network may differ from their floating-point counterparts. Thus, verification tools that operate on non-quantized ANN may return incorrect results.

We demonstrate this with the following motivating example. Assume that we want to verify the neural network in Fig. 2 , which relies on the activation function ReLU and whose output can be directly computed as:

Furthermore, assume that, in our example application, the output of this ANN must never fall below ( 1 , 2 ) ≥ 2.7, and that we want to verify whether this is true for the input ( 1 , 2 ) = (0.749, 0.498). Now, if we run an experiment with real numbers R (from the mathematical domain), the result is (0.749, 0.498) = 2.745, which satisfies our safety property ( 1 , 2 ) ≥ 2.7. However, if the same ANN is quantized to a lower precision, this is not the case anymore. Indeed, for a QNN with 4-bit integer and 6-bit fractional precision, its output becomes^(0.749, 0.498) = 2.6867, which violates our property. It is worth mentioning that such discrepancies can be even worse when larger ANNs are employed, due to cumulative error in long computation chains. Thus, in our verification approach, we make sure the actual implementation model used in an ANN implementation is captured (see Section 3.2).

Besides, we could formulate another research question of interest: what is the deepest quantization that can be applied to a given ANN so that it makes correct decisions? This way, for instance, we would be able to target heavily restricted devices while still keeping the implementation correctness based on formal guarantees. Although that is not the focus of the present work, it provides the first step towards that goal. Moreover, it paves the way for a complete verification framework suitable to ANN implementations in embedded devices.

Let us now formalize the concept of safety property we briefly mentioned in the previous Section 2.2. In general, a safety property defines the set of states that a system is designed to reach safely. In software verification, such properties are usually defined according to a user's domain knowledge, which allows him to state which program behaviors are safe [3] . In ANN verification, the black-box nature of their associated computation means that safety properties are usually defined on the inputs and outputs alone [50, 67] . In this paper, we often refer to safety properties in the following form:

x ∈ H =⇒ (x) ∈ G,

where x is an input vector, H is an input region, (x) is the corresponding output, and G is an output region. However, one may notice that our verification method supports any safety property that can be expressed in first-order logic (see Section 3.4). A powerful and general way to define an input region H is choosing a center point x ∈ D in the input domain D, and letting the set H (x, ) cover the whole neighborhood of points around it that are within a given distance (x, x ′ ) ≤ 1 [50, 67] . As an example, in the field of image classification, robustness properties are defined in this way [90] . For continuous input domains D ≡ R , such a distance is often defined in terms of the family of -norms as follows:

where = 1 is the Manhattan distance and = 2 is the Euclidean distance. Furthermore, this definition can be extended to = ∞ by introducing the so-called infinity or maximum norm ∞ ( , ′ ) = max (| − ′ |). Note that input regions defined through ∞ can be described by a set of linear constraints, a fact that makes them attractive to the verification community for efficiency reasons [61, 87, 96] . Also, input vectors can be re-scaled using a diagonal matrix , allowing us to define hyper-ellipsoids (if = 2) and hyper-rectangles (if = ∞) in the input space:

Moreover, further attention is required if the input domain M is discrete in nature, for instance, in natural language processing (NLP) applications. However, a mapping to a continuous space is often available [57] . Once we establish a definition for the input set H in (7), we can complete the definition of our safety property by choosing the corresponding output set G [67] . For regression tasks, we can again define a safe neighborhood around an output point (x) within a given distance ( (x), (x ′ )) ≤ 1, ∀x ′ ∈ H (x, ). For classification tasks, the output set G often comprises all points that assign the highest score to the desired class, e.g., G ≡ {y|(y = (x), ∀x ∈ D) ∧ ( > , ∀ ≠ )} for output class . In Section 3.4, we show how to define this kind of safety properties inside our verification tool.

Once we have defined a safety property , according to (7), we need to verify that it always holds for our (quantized) neural network. As we mentioned in Section 1, there exist many approximate techniques to do so. However, in this paper, we focus on bit-precise verification via satisfiability modulo theories (SMT) solvers [9] . Similar to Boolean Satisfiability (SAT) solving [94] , the SMT approach to verification works by converting a verification problem at hand into a logic formula and then checking whether it is satisfiable. However, SMT extends SAT beyond boolean logic and allows us to model a verification problem as a decidable subset of first-order logic. At the same time, the interpretation of these models is restricted to a combination of background theories, which are written in first-order logic with equality. More formally, given a first-order formula , encoding a verification problem, and a background theory , we say that is -satisfiable if and only if there exists an assignment such that the union ∪ { } is satisfiable.

The modeling power of SMT comes from the variety of background theories that we can use. Those theories model the semantic of common mathematical objects like real, floating-point, and integer numbers, arrays, lists, bit vectors, and the operations defined on them for computational problems [8] . While modeling capabilities of SMT are still being extended to new domains (e.g., the work of de Salvo Braz [27] ), mainstream SMT solvers (e.g., Z3 [26] , CVC4 [6] , and Boolector [16] ) already offer native support for all theories above.

SMT approaches have been applied to an extensive range of verification problems [9] . In this section, we review existing approaches for ANNs and QNNs. One may notice that due to the SMT paradigm flexibility, such approaches vary in the abstraction level at which they tackle a verification problem.

Early research applied existing SMT solvers to the verification of real-valued ANNs and showed some difficulties in scaling beyond toy examples [80] . More recently, Katz et al. proposed to extend the background theory of real numbers and include an extra predicate for the ReLU activation function [61] . Since each ReLU doubles the number of verification formulas, they introduced a dedicated lazy solver, called Reluplex, which only visits a relevant subset of formulas. Their algorithm has been subsequently extended to arbitrary piecewise-linear activation functions [62] . An alternative approach by Huang et al. asks a user to define a problem-dependent set of micromanipulations that the SMT solver can chain to search a state space [51] . With this approach, they can scale to medium-sized ANNs for image classification. Furthermore, verification approaches based on real number computation can be easily extended to cover floating-point implementations of ANNs [61, 87] .

In contrast, SMT methods to verify QNNs have to contend with a more challenging computational problem, from the theoretical perspective [47] . In this respect, Giacobbe et al. chose to represent QNN operations with the bit-vector background theory and showed that the associated verification results can be very different from their real and floating-point counterparts [39] . Similarly, Baranowski et al. proposed a new fixed-point background theory and tested it on some small QNNs [5] . In general, low-level optimizations in SMT encoding of QNNs are shown to speed up verification processes considerably [39, 47] . In the extreme case of binarized neural networks, where quantization only allows two binary states for each variable, a verification problem can be reduced to SAT solving [75] . In addition, hardware-level optimizations are crucial for efficiency too [20] .

In summary, our methodology is a generalization of the previous work by Sena et al. focused on SMT verification of CUDA implementations of ANNs [86] . As we expound in Section 3, we take advantage of existing techniques in software verification to model both ANNs and QNNs as SMT formulas. Our novelty lies in the encoding of fixed-point operations and the efficient treatment of non-linear activation functions, which allows us to verify networks beyond the simple ReLU function.

While we usually think of neural networks as mathematical models, their implementation is actually written in source code, in a given language. Thus, in this respect, neural networks can be treated like any other piece of software. The advantage of this strategy is twofold. First, we can readily adapt many existing software verification techniques to ANNs and QNNs. Second, we give a user access to these highly technical verification tools in a familiar coding framework.

This section lists the sequence of steps required to verify ANNs in such a way. To this end, we assume that an ANN is given as input in the form of a piece of single-threaded C code (see Section 3.1). Furthermore, we explain how to represent a quantized ANN by calling our finite-word length (FWL) implementation models, which are discussed in Section 3.2. Likewise, we shown how to discretize each activation function with the algorithm in Section 3.3.

Once the code has been prepared in this way, the user can specify the desired safety property with assume and assert statements, as detailed in Section 3.4. Then, we compute a reachable set of values for each variable, using the invariant inference techniques in Section 3.5. Finally, we verify the safety property via SMT model checking, as explained in Section 3.6. All the techniques we use to reduce the search space of the SMT solver are listed in Sections 3.7 and 3.8.

Our whole verification methodology is summarized in Fig. 3 . Furthermore, we conclude in Section 3.9 with a complete walk-through example of our workflow. 

The current mainstream approach to ANN development uses high-level machine learning libraries such as TensorFlow and PyTorch to define the architecture of the neural network and train its weights. Once the development phase is over, a final implementation of the ANN is produced, targeting a specific computer architecture, e.g. AMDx64, CUDA-enabled GPUs [77] , embedded systems running on FPGAs [101] , etc. Depending on the application, these implementations are optimized with several objectives in mind, ranging from speed of inference to energy consumption and memory required [82, 95] .

In this paper, we use the C language as an abstraction of all these possible system realizations, with the addition of implementation models to represent fixed-point arithmetic (see Section 3.2). Furthermore, we limit our scope to sequential code, and leave the verification of concurrent implementations of ANNs (e.g. CUDA) for future work.

At the same time, given the mathematical model of a specific ANN (see Section 2.1), there exist multiple possible sequential implementations of it. This is because neural networks are highly parallel, since the output of all neurons in a single layer can be computed independently. Furthermore, the activation potential of each neuron (see Equation 2 ) is the result of a sequence of multiply-and-accumulate (MAC) operations, whose order can be changed arbitrarily.

In our experiments in Section 4.3.5, we show that our verification framework is insensitive to changes in the order of the basic operations performed by the ANN. In other words, all equivalent implementations of the same ANN will yield the same verification performance in terms of time, memory usage and outcome. Thus, for the remainder of this section, we work under the assumption that a specific implementation is given, and detail the sequence of processing steps required for its verification.

In this section, we discuss how our implementation models work to support fixed-point verification of neural network implementations.

Generally, there are two ways of supporting fixed-point neural network implementations [39] : (1) to converting inputs into fixed-point and perform all the underlying steps, e.g., training and validation, in fixed-point; or (2) converting trained models and neural network operations, e.g., realization, from floating-point representation into fixed-point, which is then followed by a check of the desired properties. The former is likely to produce better representations, but the latter is likely to be more practical [53] , mainly because datasets are usually provided in floating-point representation. In the present work, we have chosen the latter. Moreover, such a method, also known as network compression or quantization, is also the usual way of deploying neural networks on restricted devices, which reinforces its use.

Our goal is to transform an existing model (and its constraints) defined in the C programming language into a fixed-point representation. Here, a fixed-point format is specified as ⟨ , ⟩, where denotes the number of bits to encode its sign and integral part, resulting in a representation , and indicates the number of bits to encode its fractional part, resulting in . Furthermore, given a rational number, we can represent it in fixed-point by using + bits, which is interpreted as + 2 . Such representation allows us to take a hardware platform's limitations, where a specific model will be executed, into account, in such a way that a more suitable implementation is provided. Moreover, in the present context, two's complement is used for value representation and arithmetic operations, due to some advantages, such as the wrap-around effect [19] . For instance, if we want to encode number +3.25 into format ⟨5, 3⟩, it will give rise to the following representation in memory: {00011|010}, with the most significant bit (i.e., 0) indicating the sign "+", = 3, and = 2.

In order to model the quantization effect on an ANN's computation steps, we need to convert each arithmetic operation (addition, subtraction, multiplication, or division) from floating-point to their respective fixed-point counterparts. In particular, these operations and conversions must take into account the parameters and , along with the sign bit. We achieve this goal with the implementation models proposed by Chaves et al. [19] , which have been extensively validated in the digital controller domain. Indeed, they replace the mentioned arithmetic operations (i.e., "+", "−", " * ", and "/") and then return results according to a specific precision. Furthermore, these implementation models formally define a set of methods and values that precisely represent fixed-point operations' behavior.

In Fig. 4 we show an example of how to convert a piece of floating-point source-code into a fixed-point representation with the proposed implementation models. Here, we have a code snippet that computes the activation potential of a single neuron, one of the basic operations in ANNs. One may notice how the types and operations have been changed in the fixed-point version. In particular, fxp_float_to_fxp transforms a type float into a type fxp_t (fixed point), and both fxp_add and fxp_mult make sure that the addition and multiplication arithmetic operations are performed in fixed-point and take into account the previously defined desired precision.

In summary, the fixed-point version of an ANN's code references the appropriate implementation models, thus ensuring that the behavior of each fixed-point arithmetic operation is carried out correctly. Our experiments, in Sections 4.4 and 4.4.3, show the impact of different levels of quantization granularity in ANNs.

Finally, another aspect is worth mentioning: for an entirely correct implementation, when a fixed-point format is chosen, one should still represent the dynamic range associated with the target data. If that is not done, overflow occurs, which introduces errors that can jeopardize an ANN's decision.

In other words, if a given variable holds values that range from −15.5 to 15.5, for instance, a format ⟨2, 2⟩ should not be used because that would lead to frequent overflow events. Specifically, values above 3.75 would not be represented. Consequently, in this specific case, a format ⟨5, 2⟩ (note the dynamic range provided by the integer part), for instance, would be suitable, then keeping correct computation in all associated operations.

As mentioned in Section 2.1, the choice of an activation function can have a considerable impact on verification times. While piece-wise linear functions can be readily represented as a (sequence of) if-then-else instructions, non-linear activation functions require careful adjustments to avoid severe performance degradation. This section presents an approach to convert such non-linear functions into look-up tables, thus significantly speeding up verification processes.

Assume that the non-linear activation function N : U ↦ → R is a piece-wise Lipschitz continuous function [85] , thus there is a finite set of locally Lipschitz continuous functions N : U ↦ → R for ∈ N ≤ , the so-called selection functions, such that the sets U ⊂ R are disjoint intervals, N ( ) ∈ {N 1 ( ), . . . , N ( )} holds for all ∈ U, U = ∈N ≤ U , and

where denotes the Lipschitz constant of N . The proposed discretisation approach is applied to each subset U . The general idea consists in discretising the U by obtaining the discrete and countable setŨ ⊂ U . Then, we build a lookup table for rounding the evaluation of N ( ) toÑ ( ) : U ↦ → R, and consequently rounding N ( ) tõ N ( ) ∈ Ñ 1 ( ), . . . ,Ñ ( ) . This lookup table contains uniformly distributed samples within U , including interval limits, to ensure the accuracy ∥Ñ ( ) − N ( ) ∥ ≤ . Let be defined as the length of the interval U , i.e., ≜ sup

This way, the following Theorem can be used to choose the number of samples to ensure the desired accuracy . Theorem 3.1. Let the non-linear activation function N : U ↦ → R, N ∈ {N 1 ( ), . . . , N ( )}, be piecewise Lipschitz continuous such that each selection function N ( ) : U ↦ → R presents the Lipschitz constant , and consider the discrete approximationÑ ( ) ∈ Ñ 1 ( ), . . . ,Ñ ( ) , where each selection functionÑ : U ↦ → R, for ∈ N ≤ is obtained with U ⊂ U containing samples. The approximation error is bounded as

for a given , if

holds.

Proof. Given that the length of each interval U is (cf. (11)), the length of each sub-interval, obtained by uniformly dividing U at the samples, is −1 . Considering the Lipschitz continuity in (10), the rounding error forÑ ( ) is bounded as

If (12) holds for all ∈ N ≤ , the inequality

and, consequently, (13) also hold. Moreover, from (14) and (15),

Based on Theorem 3.1, the number of samples used in the discretization of nonlinear activation functions, such as the N TanH and N Sigm , described respectively in (4) and (5), can be computed to ensure some desired accuracy. Without loss of generality, the approximationÑ ( ) can be defined asÑ

where A : U ↦ →Ũ is an arbitrary approximation operator, e.g., rounding and quantization. For instance, consider that we want to obtain the functionÑ Sigm , which approximates N Sigm based on a discrete domainŨ, with target accuracy = 0.01. It is clear that N Sigm is globally Lipschitz continuous with constant Sigm = 0.25 since the sup ∈U |N Sigm ( )| = 0.25, and U = R. Moreover, let us choose the following three intervals to define the approximationÑ Sigm ( ):

and

since the derivative of N Sigm ( ) is negligible for ∈ U 1 ∪ U 3 , i.e., 1 ≈ 0 and 3 ≈ 0, while the constant 2 is equivalent to the global Lipschitz constant, i.e., 2 = Sigm = 0.25. Now, we can use (13) to compute the number of samples in each interval necessary to ensure the desired accuracy = 0.01. Accordingly, the numbers of samples are 1 = 3 = 1 and 2 = 1001 since 2 = 40 (cf. (11) ). Notice that the approximators A can be arbitrarily chosen. For this example, it is suggested to choose A 1 = −20 and A 3 = 20, because it is not necessary to have more samples than the limits of the intervals for U 1 and U 3 . Finally, A 2 can be chosen as the half-towards-zero rounding with 3 decimal digits for floating-point and real ANNs, and as the underlying quantization function for fixed-point ANNs. Fig. 5 illustrates the effect of the discretization when evaluating the sigmoid function. Note that the approximation fits well for = 0.01, and it becomes poor when increases. It is worth mentioning that a look-up table is fundamentally a trade-off between speed and memory. If the latter is not a restriction, verification processes may benefit from such a strategy. Another interesting point is that such a discretization strategy should be under the final desired fixed-point format so that a safety-property verification is not compromised. This way, should be arbitrarily small and also much lower than the quantization step incurred by a fixed-point format. 

As we explain in Section 2.3, verifying an ANN means proving that a given safety property holds. Such a safety property is a falsifiable mathematical relation defined on the values of an ANN's variables. Since we are considering software implementation of ANNs here, in this section, we then show how to annotate ANN code and also how to specify a desired safety property.

As a preliminary step, we annotate code by replacing the concrete input to the neural network with a general non-deterministic input. We do so by assigning a non-deterministic value to each input variable as in the following example (another example is shown in Figure 6 ):

where we use the notation nondet_float() prescribed by our underlying verification tool ES-BMC [36, 37] . With this, our verification tool knows to expect any possible input, and it is then the role of the safety property (see Equation 21 below) to restrict the input space to the sub-domain of interest.

Let us consider safety properties in the general form x ∈ H =⇒ y ∈ G, where knowing an input vector x, belonging to H , guarantees that the output vector y = (x) belongs to G. Consequently, we encode the premise of this implication with a pre-condition instruction assume, specifying the set of values H that each ∈ x can take. For example, a rectangular domain for the input variables

and assume(x_2 >= -0.5 && x_2 < 0.5). (22) This notation instructs the subsequent SMT model checking to search only the inputs that satisfy the conditions specified in the assume instruction, as it ignores an execution when being false (e.g., see __ESBMC_assume [2] ), thus making sure that the premise of the safety property x ∈ H is satisfied. Note also that the instruction assume is general and supports any boolean condition as its argument. 1 This way, any form of input region H can be specified, as long as it is valid C code syntax. At the same time, hyper-rectangular input domains tend to lead to faster verification times, as mentioned in Section 2.3.

In contrast, we encode the conclusion of the implication with the post-condition instruction assert, specifying the set of values G that each variable ∈ y can safely range in. For instance, if we have a binary classification network with two outputs 1 and 2 indicating the score of each class, we can encode the conclusion of a robustness safety property for the second class as assert(y_2 > y_1).

Consequently, it requires that when that premise is satisfied, our binary network always predicts the second class. As for the input region H , the assert instruction can be used to specify a variety of output regions G, but now making an assessment of what is expected.

Once the safety property has been specified, as we explain in Section 3.4, we can inject further assume instructions in the code and reduce the model checker's search space. Indeed, given the sequential nature of ANN computation, the set H of values allowed by the premise of a safety property also constrains the range of the following intermediate computation steps. Thus, if we can explicitly derive and unfold these additional constraints onto intermediate variables, in such a way that we propagate constraints and benefit from them on subsequent operations, we can more succinctly tell a model checker where to look for counterexamples. In general, deriving additional (over-approximated) constraints on intermediate computation steps falls under the umbrella of invariant inference [83] . It is based on the discovery of an assertion that holds during the execution a given piece of code, which can then be used in verification procedures. For neural network code, which does not contain loops or dynamic memory allocation, we find that an interval invariant analysis suffices [70] . Such a method of invariant analysis computes lower and upper bounds on the values of each program variable (e.g., ≤ ≤ , where , are constants and is a variable), by propagating the initial set H through an ANN with interval arithmetic rules. One may notice that more complex constraint propagation methods (e.g., zonotopes and polyhedra) exist in the literature [92] , but whether reduction in search space justifies the additional computational cost is an open problem. Moreover, given that neural network quantization, as tackled here, is already used for integrating this kind of system into restricted devices, low complexity is desired, at least initially.

On the more practical side, there are many tools to perform interval analysis of C code. In our experiments, in Section 4, we have used the evolved value analysis (EVA) plugin of the opensource tool FRAMA-C [14] . We then inject intervals into ANN code as additional pre-condition instructions assume on intermediate variables, thus covering the entire processing chain. Finally, we have compared this method with the native interval analysis support provided by the stateof-the-art verification tool ESBMC [72] , in Section 4.3.2, and found that combining them (both enabled) yields the best results.

Given the annotated C code from Sections 3.4 to 3.5, we are now tasked with answering the following verification question: do all inputs that satisfy pre-conditions assume also satisfy associated assert post-conditions, in a specific ANN implementation? In other terms, are we able to find at least one specific input that violates a safety property, given an ANN implemented with a specific precision? In this section, we explain how to answer this question with state-of-the-art symbolic model checking techniques.

In general, model checking is concerned with verifying whether a given property holds for a finite state transition system , which is typically represented by a triple ( , , ) [21] . More formally, these mathematical objects are defined as follows:

• is the set of states a system can be in, where each state consists of the value of the program counter (PC), local and global variables; • : → {0, 1} is an indicator function for a set of initial states;

• : → , with , ∈ , is a transition function describing a system's evolution, i.e., pairs of states specifying how a system can move from state to state; • : → {0, 1} is an indicator function for safe states.

In our case, the annotated ANN code defines these objects implicitly. represents all possible value assignments to a set of program variables, including the PC. indicates all assignments that satisfy existing assume pre-conditions. holds the semantic of each instruction in code, defining how to go from one state to another, which allows checking for reachability (cf. Definition 3.2 below). Finally, represents a safety property encoded with existing assert post-conditions. In practice, several state-of-the-art model checkers accept C code as input [12, 36, 37, 64] . Frequently, input code is readily converted into static single assignment (SSA) form before further processing [24] , which has the advantage of making underlying finite-state transition systems more explicit. We show an example, in Fig. 6 , parts (a) and (b), of such a conversion procedure.

Note that, in all experiments in Section 4, we use ESBMC for this model checking step [36, 37] . Like any other state-of-the-art model checker, ESBMC has been heavily optimized to reduce verification times. However, not all of these optimization techniques apply to feed-forward neural network code, which does not contain loops and recursions. In the following Sections 3.7 and 3.8, we clarify which techniques do apply to ANN code.

The SMTLIB logic format introduced an assertion stack concept and the ability to push and pop assertions of it [7] . In particular, some SMTLIB compliant SMT solvers have an internal stack of assertions, which we can add new assertions to or remove old ones from. The main idea here is to enable assertion retraction and lemma learning incrementally. The former allows one to add assertions to a formula, evaluate the individual result, and then return the same formula to its original form. The latter happens when the SMT solver stores facts (in the form of lemmas over a formula's variables). In summary, it has already determined a formula, which may prove helpful in future checks.

Here, we enable the underlying SMT solver to use lemmas determined during previous checks for future ones, thereby optimizing search procedures and potentially eliminating a large amount of formula state-space to be searched. Note that previous studies report encouraging results using incremental (bounded) model checking for software, increasing the search depth without leading to the overhead of restarting a verification process from scratch [44] . This way, we apply incremental SMT solving to verify neural net implementations, where a formula is built up in stages, and lemmas are learned, along the way, about that same formula.

In particular, this incremental verification is beneficial to exploring neural net implementations by ESBMC since they contain various ite operators (e.g., to represent ReLU activation functions). The existing operation of the SMT solver follows directly from ESBMC. Indeed, once we build the directed acyclic graph (DAG) and produce an SSA program by symbolic execution, from a neural net's implementation, that program is converted to a fragment of first-order logic and translated into a form acceptable for the SMT solver. Then, after checking the satisfiability of a given formula, the latter is discarded. Here, many ite operations will be converted, solved, and discarded during a neural net's verification procedure. Since each variable in an ite operation is assigned only once along each path in SSA form, this requires a case split to evaluate the activation function, e.g., = ? : . As a result, we call the SMT solver during a symbolic execution to check the satisfiability of the guard and then determine the value of variable . Using ite retraction to build and deconstruct a formula has the potential to reduce SMT-conversion overhead, and lemma learning could lead to swifter verification times. The SMT solvers supported by ESBMC (i.e., Z3 [26] , Yices [31] ) claim lemma learning as a feature, thereby allowing us to evaluate its impact for verifying neural-net implementations.

To use incremental SMT, during neural net verification, we must identify ways to reuse an SMT formula by pushing and popping ite operations into the solver. In particular, we retain the formula produced for an ite operator, identify the common prefix between it and the next ite operator produced, and retract all the ite operations that can be evaluated. Then, we place the ite operators that could not be evaluated on top of the remaining formula. Fig. 6 illustrates this approach. In particular, in Fig. 6 (a), we have two inputs and in lines 4 and 5, respectively; three assignments in lines 6, 8, and 10; three ite operators, which represent ReLU activation functions, in lines 7, 9, and 11; and one assertion representing a safety property, in line 12. Fig. 6 (b) illustrates the program of Fig. 6 (a) converted into SSA form (i.e., each variable is assigned exactly once), which is the format we use for incremental learning.

During the symbolic execution of this neural net implementation, based on Fig. 6 (a), we check the satisfiability of guard "a < 0", in line 7, and conclude that it could either be evaluated as "true" or "false" since "a" can assume values between −3 (lowest) and 2 (highest). As a result, we cannot simplify this expression before checking the safety property in line 12 of Fig. 6(a) . However, we can learn from this assignment and place its ite operation on top of the remaining formula, which can then be used to check the mentioned safety property. After that, we check the satisfiability of guard "b < 0", in line 9, and conclude that it always evaluates to "false" since "b" can assume only positive numbers between 0 (lowest) and 5 (highest). So, we are thus able to remove this expression and the respective assertion. Similarly, we check the satisfiability of guard "f < 0", in line 11, and also conclude that it always evaluates to "false" since "f" can assume only positive values between 0 (lowest) and 4 (highest).

We show the simplified neural net implementation using our incremental verification via lemma learning in Figure 6 (c). One may notice that we have safely removed two ReLU activation functions represented by the variables 2 and 2, initially present in Fig. 6(b) , which thus reduce the formula's size to be checked by the underlying SMT solver. Note further that we have learned that variable "a" can assume values between −3 (lowest) and 2 (highest), which can be used to check the assert statement specified in line 9 of Fig. 6(b) . Consequently, that same assert can not be identified in Fig. 6 (c) anymore because the knowledge of its range allowed such a simplification. The assertions 1 ≤ 5 and 1 ≤ 4 were also removed since we previously learned the intervals for the variable and . Lastly, we can observe the ability to perform a query at any neuron using incremental verification, which can help prune neural net implementation before deploying it to an embedded device with time, memory, and energy constraints. 

Our employed verification engine implements general code optimizations, when converting a neural net implementation to SMT. These include constant folding, slicing and expression balancing [22] , which we briefly introduce here.

Constant folding evaluates constants, including nondeterministic symbols, and propagates them throughout the resulting formula, during encoding. In particular, we exploit the constant propagation technique to reduce the number of expressions associated with specific neuron computation procedures and activation function. Thus, we simplify the SSA representation, using local and recursive transformations, to remove functionally redundant expressions (for neuron computation procedures and activation functions) and redundant literals (for safety properties), as

We apply such simplifications to reduce the size of the resulting formula and consequently achieve simplification within each time step and across time steps, during the encoding procedure of a neural net's implementation. In our experimental evaluation, in Section 4.3.2, we have noticed substantial improvements using these simplifications in formulas, but we have not identified improvements using the constant propagation approach itself. It happens because neural net inputs are typically symbolic ones and not constants, as can be noticed in the illustrative example in Fig. 6 , where incremental learning removed the activation functions for neuron and output . Slicing removes expressions that do not contribute to the checking procedure of a given safety property. It is an essential step to improve a program's verification procedure, considerably, in some cases [71] . Our verification engine implements two slicing strategies in combination. First, it removes all instructions after the last assert in the set of SSA. Second, it collects all symbols (and their dependent symbols) in assertions and removes instructions that do not contribute to them. When used in combination, both slicing strategies ensure that unnecessary instructions are ignored during SMT encoding. As an example, the code in Fig. 6(a) can be considered. If we are interested in checking that neural net's output only, we could rewrite the final assert statement, in line 12, as <= 4. Consequently, such a modification do indicate that everything not involving does not cause an impact on the conclusion of the intended safety property. Based on such a scenario, the resulting SSA for the code in Fig. 6 (a) would be sliced as

where there is no presence of information (states) regarding neurons and . In our experimental evaluation, in Section 4.3.2, we have observed that slicing can significantly reduce the resulting SMT solving time. Expression balancing reduces the size of SMT formulae by reordering long chains of operations with the associative rule. This technique has been recently applied to neural networks by Giacobbe et al. [39] , but has been used in compilers for decades. In brief, the computation of neuron potentials in ANNs requires a linear combination of the neuron inputs (see Equation 2 ). Depending on the specific implementation, the resulting sequence of multiply-and-accumulate operations (MAC) in the code is translated to SMT formulae of different sizes. In the worst case, which is portrayed in Fig. 7a , the formula size is linear in the number of MAC operations. Expression balancing ensures that the SMT formulae are always reordered as in the best case scenario shown in Fig. 7b . That is, the sequence of MAC operations is split over multiple accumulators in a divide-and-conquer fashion, yielding a set of semantically equivalent, but smaller SMT formulae. In Section 4.3.2, we show that this associative balancing step is crucial in making ANN verification viable. Note that this result is consistent with those presented in [39] . Furthermore, in Section 4.3.5 we show that, thanks to this balancing step, the performance of our verification methodology is stable across different implementations of the same ANN.

We conclude this section with an illustrative example of our verification methodology. We do so in order to clarify the user's side of the workflow illustrated in Fig. 3 . Later, in Section 4, we report more details on the range of ANNs and safety properties that can be verified with our methodology, as well as the efficiency of doing so.

The present example, illustrated in Fig. 8 , shows how to verify a character recognition ANN. First, given a network's architecture and weights, in a high-level representation, as in Fig. 8a , such elements should be converted into single-threaded C code. This task can be achieved through the popular machine learning libraries PyTorch [79] and Tensorflow [1] , or, like in many of our experiments, in Section 4, by converting from the mid-level representation NNet 2 . In this example, we use the neural network from our Vocalic benchmark (see Section 4.2.2) quantized to a fixed-point representation with 8 integer (including sign) and 8 fractional bits. Second, the ANN source code undergoes a further sequence of transformations. Initially, we replace all floating-point arithmetic operations with the corresponding fixed-point implementation models (see Section 3.2), given that our ANN is quantized. Then, we also replace any sigmoid, hyperbolic tangent, or piecewise-linear activation function with its corresponding discretized look-up table (see Section 3.3).

Third, a safety property is encoded by adding the corresponding pair of assume and assert instructions. In the present example, we check for robustness around a specific input image, which we show in Fig. 8b . More formally, we define the input region of our safety property (premise) as a set H = { : | − | ∞ ≤ }, where the centre point corresponds to the 5 × 5 pixel values in the image of the ideal character "A", i.e., without deviation, in Fig. 8b , and = 80. For reference, we report the lower and upper bounds of H in the gray image pixel domain, in Figs. 8c and 8d, respectively.

Likewise, we set the output region of the safety property (conclusion) as the set of all outputs that assign a higher score > , ∀ ≠ to class "A" than to any other output classes. Note that the final softmax layer, typically included in classification ANNs, can be omitted for our purposes since it is a monotonic function of the score of each class [13] . After this, a static analysis tool such as FRAMA-C [14] propagates the input region H through the associated ANN code and annotates it with additional assume instructions, representing the reachable values-interval of each intermediate variable (see Section 3.5).

Fourth, annotated C code goes through a model checker that tries to falsify a safety property. In our experiments (see Section 4), we have used ESBMC to do so, as it is a good representative of state-of-the-art SMT model checkers [36, 37] . If a given safety property can not be verified, ESBMC returns a counterexample that falsifies it, which represents a potential adversarial attack on a neural network. In the present example, ESBMC does indeed report such a counterexample, which we show in Fig. 8e. More adversarial examples can be seen in Figs. 18a, 18b, and 18c , for a wide range of safety properties and quantization granularities of our character recognition ANN.

In this section, we test the performance of the verification approach we introduced in Section 3. In this regard, we are mainly interested in the following research questions: RQ1 -Ablation study -Is it possible to establish the role of each of the enhancement techniques introduced in Section 3 and also define an optimal setup, both regarding total verification time and performance? RQ2 -Quantization effects -How does a quantization choice influence our verification process and the safety of a neural network? RQ3 -Comparison with SOTA techniques -What is the performance of our verification approach when compared to the existing literature?

Regarding RQ1, since those techniques were first introduced for software verification in general, we are interested, in particular, in finding their optimal configuration to verify ANNs, including contribution and general setup. In addition, RQ2 is related to quantization of ANNs, which is in the core of the present work and have the potential to provide a methodology regarding integration into target platforms. Moreover, if we were to verify the same property for different quantization levels, would we observe any difference in verification time or outcome? Finally, regarding RQ3, it is always of paramount importance to position a given approach among the existing scientific knowledge.

We present our answers to those questions in the following way. In Section 4.1, we discuss a configuration step regarding quantization and also general data processing to provide adaptation and avoid overflow in ANN operations. In Section 4.2, we describe the datasets and ANNs that constitute our verification benchmarks, including the necessary minimum number of bits for correct data-range representation. In Section 4.3, we isolate the contribution of each component of our verification approach and propose the configuration that yields the best results performance-wise, which answers RQ1. In Section 4.4, we compare the performance and output of our verification approach across different quantization levels of the same problem, which addresses RQ2, while analyzing important aspects and general behavior and also providing guidance on integration into restricted platforms. In Section 4.5, we compare our verification framework with the most popular SOTA approaches, which fulfills RQ3. Finally, in Section 4.6, we list the remaining limitations towards large-scale verification of fixed-point ANNs. All benchmarks, tools, and results associated with the current evaluation are available for download at https://tinyurl.com/6y7e49vk.

As mentioned at the end of Section 3.2, when correctness comes into play, not every quantization format can be used. Indeed, if a format that is not suitable to the target ANN is chosen, overflow will likely occur, compromising operation results and general ANN output. Nonetheless, a designer can also incur severe quantization and suppose that errors due to wrong operations are an acceptable side effect (even under frequent overflow). Still, our goal is to provide compression that results in quantization error only, then preserving an ANN's associated dynamic range and correct computation of operations in neurons.

Another aspect is that input data may present a broad diversity of dynamic ranges. As a consequence, they are usually processed in scaled format. In our framework, input data is first normalized to the range [0, 1] and then fed to a given ANN (also for training). This way, the initial (input) dynamic range is always known.

Consequently, it is essential to analyze neurons in a given ANN and then identify the minimum and maximum associated values resulting from their processing, given input date in the range [0, 1], which will define the minimum number of bits for the integer part of a given representation. It does not specify maximum compression because it only intends to represent the existing dynamic range and avoid overflow correctly. Besides, we should also check the number of bits for the fractional part to provide the desired accuracy.

Note that the discovery of the minimum number of bits for the integer part is made by using Eq. (8), with = 1, and taking into account all weights of each neuron to find the maximum magnitude. Alternatively, FRAMA-C [14] can also be used, as it reveals intervals associated with variables in ANN code.

In our evaluation, we consider ANNs trained on two datasets: the UCI Iris dataset [30] and a vocalic character recognition dataset [86] . This section gives the details regarding the employed datasets, the neural networks we trained on top of them, the safety properties that we used to test our verification approach, and, finally, our general experimental setup. [30] consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). This dataset contains both the length and width of the sepals and petals in centimeters (our inputs) and the iris specie label (our output). Here, we use TensorFlow version 1.4 [1] and keras [43] to train a feedforward neural network with layers of 4 × 7 × 3 neurons, hyperbolic tangent activation functions, and softmax output layer. We train such a neural network to predict the correct Iris species with the backpropagation algorithm and cross-validation [13] . When quantizing the ANN to fixed-point arithmetic, we followed what was presented in Section 4.1. We found that the maximum neuron output was bounded, in modulus, by 23.3. Consequently, we allow for 6 integer bits, including sign, as they are required to avoid overflow. In terms of safety properties, we specify hyper-rectangular input regions for each species: setosa, versicolor, and virginica. We identify the center of these regions from the dataset with the granular fuzzy clustering algorithm in [23] . Then, for each of the four input variables, we computed its maximum range. With it, we generated nine regions for each class, sharing the same center but with different sizes ∈ {1, 2, 5, 8, 10, 20, 30, 40, 50} of the hyperrectangle surrounding it, where is a percentage representing the fraction of the maximum input range.

The vocalic dataset [86] consists of 200 gray-scale images with dimensions 5 × 5 pixels. Half of the dataset consists of the base images illustrated in Fig. 9 and also noisy versions of them. In contrast, the other half presents non-vocalic images. With it, we have trained a feedforward neural network with architecture 25 × 10 × 4 × 5 and sigmoid activation functions. As Fig. 9 shows, there are five output classes that this network learned to discriminate via backpropagation algorithm and cross-validation. Once again, we have followed what was presented in Section 4.1 and found 53.9 as maximum neuron output. Consequently, we have quantized this ANN to fixed-point arithmetic with a minimum of 7 integer bits, including sign, as they are required to avoid overflow. As far as the safety properties are concerned, we specify five hypercubic input regions corresponding to the vocalic labels. The centers are defined by the base images in Fig. 9 .

Similarly to the Iris benchmark, we generate five instances of these regions with different sizes ∈ {10, 20, 40, 80, 120}, where represents the hypercube's side length. 

The Acas Xu benchmark [58] is the result of avionics research in airborne collision avoidance systems (ACAS) for unmanned aircrafts (Xu). In particular, when avoiding a nearby aircraft, some specific piloting decisions must be taken. These are recorded in a large state-action table that is impractical to store on-board due to its memory requirements. The Acas Xu benchmark splits and compresses such a table into a set of 45 neural networks. The split is done by discretizing the following two input dimensions: time until loss of vertical separation (9 intervals), and previous advisory action (5 actions). The remaining 5 inputs are fed into a fully-connected feedforward neural network with ReLU activation functions and architecture 5 × 300 × 300 × 300 × 300 × 300 × 300 × 5, which outputs a prediction for each of the 5 possible actions. We quantize all these 45 ANNs with 27 integer bits, which is the least number of bits required to avoid overflow in the worst-case scenario, i.e., with a neuron output of 72142560.0, as pointed out by FRAMA-C [14] . More details on its associated safety properties can be found in [61] and in Section 4.5.

We have conducted our experimental evaluation on a Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz with 128 GB of RAM and Linux OS. All presented execution times are CPU times, i.e., only the elapsed periods spent in allocated CPUs, which was measured with the times system call [69] . All experimental results reported here were obtained by executing ESBMC v6.6.0 3 with the following command line parameters, unless specifically noted: esbmc <file.c> -I <path-to-OM> -force-malloc-success -no-div-by-zero-check -no-pointer-check -yices -no-bounds-check -interval-analysis -fixedbv. In general, we let ESBMC run without time or memory limits. The timeouts reported in the following experiments are all due to exceedingly high memory consumption. All of our benchmarks have been annotated with the reachable intervals provided by FRAMA-C, unless specifically noted. In particular, we executed FRAMA-C using the following command: frama-c -eva -eva-plevel 255 -eva-precision 11.

This section aims at evaluating the impact of different aspects of our approach on the total verification time. Here, our aim is both to discover the best configuration for our verification tool and shed some light on the importance of each technique for reducing the search space of the verification problem. Specifically, we address four choices in our verification approach: SMT solver, optional parameters offered by the ESBMC verification engine, interval analysis technique and expression balancing strategy.

For the experiments of the present section, we have chosen ESBMC as our verification engine since it has been extensively evaluated at various SV-Comp [11] competitions, where it has consistently achieved state-of-the-art results [36] . More in detail, the ESBMC model checker takes care of converting input C code into SMT formulae and then calls an external SMT solver. Currently, ESBMC supports four solvers: Bitwuzla, Boolector, Yices, and Z3. In general, they yield different verification results, both in terms of the generated counterexample (if any) and verification time.

Here, we are interested in comparing the performance of such solvers in verifying ANN implementations. To this end, we run them on all our fixed-point benchmarks, with word lengths of 8, 16 and 32 bits. With this choice, we cover the most popular quantization lengths, and observe the behaviour of our verification methodology on a varied test suite. We use these experimental settings all throughout our ablation study (see also Sections 4.3.3, 4.3.2 and 4.3.5).

The results of our comparison are summarized in Fig. 10 . There, we can see that solvers Bitwuzla and Boolector have nearly identical performance, in terms of verification time (Fig. 10a) . In contrast, Yices exhibits a considerable advantage across the whole verification suite, being, in some specific cases, even two orders of magnitude faster (Fig. 10b) . Finally, solver Z3 struggled to complete the majority of verification runs, and it is, in general, orders of magnitude slower than the other three solvers. For this reason, we do not portray its results in Fig. 10 . Given the results in Fig. 10 , we choose Yices as our underlying SMT solver for the rest of this experimental section. While it is impossible to know exactly why Yices is the best-performing solver on our test suite, we speculate it is a consequence of the fact that ESBMC encodes verification problems into SMT formulae with the formalism of QF_AUFBV logic. 4 Here, QF stands for quantifierfree formulas, A stands for the theory of arrays, UF stands for uninterpreted functions, and BV stands for the theory of fixed-sized bit-vectors. For this type of formulae, Yices represents the state-of-the-art SMT solver. 5 

In Sections 3.7 and 3.8, we have presented a number of state-of-the-art software verification techniques that apply to ANN implementations. From our prior experience of participating in software verification and testing competitions (e.g., SV-COMP and Test-Comp), such techniques play an essential role in optimizing the performance of ESBMC on a given set of benchmarks [36, 38, 73] . In the present section, we quantify their individual impact on verification times of our test suite and comment on their relative performance.

Here, we rely on the fact that the ESBMC's verification engine allows us to toggle each separate technique via command-line parameters. More specifically, the list of verification techniques and corresponding ESBMC parameters are as follows:

• Constant propagation. It can be disabled with the option no-propagation. Otherwise, it will generate a minimal set of SSAs in the symbolic engine. • Slicing. It can be disabled with the option no-slice. Otherwise, it will eliminate redundant or irrelevant portions of a program [25] . In ESBMC, this is applied to the SSA program before it is encoded to SMT to reduce the number of variable assignments by identifying variables not used to evaluate any property assertion. • Incremental verification. Activated with the (experimental) options smt-during-symex and smt-symex-guard. The former enables incremental SMT solving using the SMT solvers Yices or Z3, the latter allows calls to the solver during symbolic execution to check the satisfiability of the guards. • Expression simplification. It can be disabled with the option no-simplify, effectively neutering constant propagation so that no fact is statically determined to be true or false, and always end up exploring to the top of the unwind bound. In the plots, we discriminate between successful verification outcomes (S) and falsifiable safety properties that admit a counterexample (F).

Here, we quantify the impact of each technique on the same test suite of Section 4.3.1. We do so by setting a reference configuration and toggling one verification technique at a time. For reasons that become clear from the results shown in Fig. 11 , our reference configuration of ESBMC has constant propagation, slicing and expression simplification enabled. In contrast, we choose to keep incremental verification disabled.

As the results in Fig. 11a show, constant propagation makes no difference on our test suite. This is because we are verifying a specific kind of safety property, namely robustness to adversarial examples, which allows all input variables to be modified. As such, there is no constant input that can be propagated through the ANN code, thus yielding no reduction in the SMT formulae size. At the same time, we believe that constant propagation is a useful technique for safety properties that restrict the attack surface to just a subset of the input variables, as the ones identified by Karmon, Zoran, and Goldberg [60] .

In contrast, Fig. 11b shows that slicing yields a small improvement in performance, which becomes the more significant the shorter the verification time is. We speculate that this is because neural networks are usually redundant (e.g., see dropout [13] ), and thus the majority of neurons contribute to the ANN output. As a consequence, only a small number of expressions can be removed with slicing.

Interestingly, incremental verification (cf. Fig. 11c) does not improve verification time as expected. We believe this happens because the cost of deriving and storing new facts during the verification process outweighs the reduction in search space they induce since it performs various calls to the solver. Still, we hypothesize that incremental verification may offer some advantages when verifying not only one but also a whole set of safety properties since it allows incrementally remembering important facts across properties, whose net contribution may pay off. For example, we could perform a query at any neuron using incremental lemma learning, which could help prune neural net implementation before deploying it to an embedded device with time, memory, and energy constraints. However, we leave the exploration of such a hypothesis for future work.

Finally, expression simplification is crucial in making the verification of our test suite practical. Indeed, without expression simplification, none of the safety properties could be checked before hitting our machine memory limit of 128GB, despite letting the verification process run without any time limit.

In Section 3.5, we introduced interval analysis as an essential pre-processing stage before running the verification engine on ANN code. Here, we show the effect of disabling such an important step on total verification times. Furthermore, we compare two approaches to interval analysis and discuss their results. The first requires FRAMA-C [14] to annotate ANN code with additional assume instructions. In contrast, the second requires running ESBMC with the extra -interval-analysis option enabled. Note that both of them compute hyper-rectangular constraints over program variables.

For consistency with the previous experiments, we evaluate the impact of these two interval analysis options on the same test suite as in Sections 4.3.1 and 4.3.2. We present the results in Fig. 12 , where the native -interval-analysis option and the externally computed intervals by FRAMA-C are compared with our reference configuration of ESBMC without any form of interval analysis. Note how the former has almost no impact on the verification time, while the latter can improve it by up to two orders of magnitude. Still, regarding the use of FRAMA-C, it is interesting to notice that we only observe improvement on successful safety properties (S), i.e., those that do not admit a counterexample. This way, the verification time of falsifiable properties (F) does not appear to be improved by interval analysis on our test suite.

On the one hand, as no counterexample is found, the FRAMA-C's more sophisticated interval analysis indeed pays off, given the apparent reduction in the state space that must be explored. On the other hand, when a property is falsifiable, that seems to be easily identified in the proposed framework and adopted test suite. As future work, we can perform a deep analysis of that matter and then even propose improvements in this interval analysis focused on ANN code and properties. Note that the intervals produced by ESBMC work only for integer variables [35] , while Frama-C can make intervals for integer and floating-point ones [17] . Since our benchmarks contain heavily floating-point computations, we expected Frama-C to improve our verification results considerably compared to the interval analysis implemented in ESBMC, particularly for safe neural nets due to the state-space size. Such performance improvement is in line with our previous experiments over a large set of open-source software benchmarks when enabling invariant generation [36] . In particular, in the mentioned study, invariant generation based on intervals allowed us to verify 7% more programs using a k-induction proof rule. Therefore, we chose to use both the -interval-analysis option in ESBMC and the FRAMA-C's intervals for the upcoming experiments. To this end, we compare three different resolutions of our look-up tables, which we call Res1, Res2, and Res3. These discretize the input interval [−6, +6] with one, two, or three decimal fractional places, respectively. Outputs for inputs that fall outside that range are automatically saturated to 0 or 1 for the sigmoid function and −1 or +1 for the hyperbolic tangent one. We report the corresponding results on the Iris and Vocalic benchmarks with 8, 16, and 32 bits, all condensed in Fig. 13 . Although coarser resolutions usually result in faster verification times, as expected, given the inherent speed-up in operations, one may also notice some outliers: all regarding the Iris benchmark, when comparing Res1 with Res2, and a mixture of Iris and Vocalic benchmarks, when comparing Res2 with Res3. This is because different look-up table resolutions affect the computation of each neuron's output, and, in some cases, even the ANN's output itself (see example in Section 2.2). Consequently, a given violation that happened early during state-space exploration may then occur later or may not be even identified anymore, thus introducing a lot of variability in the verification time.

A more outcome-oriented comparison is presented in Table 1 . As one can notice, the verification outcome is indeed affected by the resolution choice. In fact, comparing Res1 and Res2 on the Vocalic benchmark yields one instance where the two verification runs disagree: Res1 reports a falsifiable property with a counterexample (F), whereas suh counterexample disappears with the finer resolution Res2 and the property is declared safe (S). Unfortunately, if we increase the resolution further to Res3, the additional computational requirements overwhelm our verification setup, and we begin to observe a number of time-outs. This is more noticeable for the Vocalic benchmarks, because they employ a larger ANN. Fig. 13 . Comparison of verification times with different discretization resolutions for activation functions on the fixed-point Iris and Vocalic benchmarks. On the left, (a) comparison between one and two decimal places; on the right, (b) comparison between two and three decimal places. In both plots, we only report benchmarks that did not incur in timeout.

In conclusion, choosing the right discretization resolution is a trade-off between verification time and possible errors in verification outcomes. In the ablation study in Section 4.3 and the later quantization experiments in Section 4.4, we choose the intermediate resolution Res2, based on two main reasons. First, it is the finest resolution that does not incur in large amounts of timeout when verifying our benchmarks. Second, all the counterexamples generated with it are valid, as we confirmed by running them through a non-discretized MATLAB implementation of the corresponding neural networks.

In Section 3.1 we mentioned that a single ANN can be implemented in multiple ways. In fact, due to the intrinsic parallelism of neural architectures, the order of many mathematical operations can be shuffled arbitrarily. Here, we show that our verification methodology produces the same result (time and outcome) for very different orderings of these mathematical operations, and thus its performance is stable across them.

Specifically, we focus on the order of operations required to compute the activation potential of each neuron, one of the basic building blocks of ANNs (see (2) ). In this regard, we compare two Table 1 . Comparison of verification outcomes with different discretization resolutions of activation functions on the fixed-point Iris and Vocalic benchmarks. On the left, (a) comparison between one and two decimal places; on the right, (b) comparison between two and three decimal places. Both tables are structured as confusion matrices: entries on the main diagonal represent benchmarks with the same outcome under both resolutions. There, we discriminate between successful verification outcomes (S), falsifiable properties that admit a counterexample (F), and properties that incurred in timeout (TO).

opposite implementations of it that we exemplify in Fig. 14. On the one hand, we have run a fully sequential version of that ANN code, where each multiply-and-accumulate (MAC) operation in (2) is executed in the same order as the input vector x. We implement this version of the code with simple loops as in the example of Fig. 14a . On the other hand, we have also run a balanced version of the ANN code, where the MAC operations are reordered in a divide-and-conquer sequence to minimize the number of additions, as in the example of Fig. 14b . Such associative rebalancing procedures are common optimizations performed by compilers, as they reduce the total number of machine instructions and improve execution time on out-of-order processors [28, 59, 102] Fig. 15 show very little difference in verification time and identical verification outcomes (except for one single timeout with balanced code). The reason for such a behavior lies in the understanding that ESBMC performs several aggressive expression-simplification steps including associative techniques, as explained in Section 3.8. As such, the final set of SMT formulae that are fed into the solver are quite insensitive to the order of operations in ANN code. Thus, we can conclude that the performance of our verification methodology is consistent across different implementations of the same ANN. In all plots and tables, we discriminate between successful verification outcomes (S) and falsifiable safety properties that admit a counterexample (F).

The results presented here successfully answer RQ1 -Ablation study: we have identified an optimal configuration for the ESBMC verification engine within our framework, which consists of using the SMT solver Yices and the interval-analysis option in conjunction with FRAMA-C intervals. Moreover, we quantified the individual importance and associated influence of a number of related techniques: constant propagation, expression simplification, slicing, incremental verification, discretization of non-linear activation functions, and code generation.

In Section 4.3, we established what the best configuration of our verification method is by comparing its runtime under different scenarios. Similarly, in the present section, we compare its verification time and output along another dimension: the quantization level of ANNs. Our main result is that the granularity of ANN quantization may influence verification performance, but that may even be considered minor, depending on the specific aspect being evaluated. Here, we show that this is true both for verification time and verification outcome. Consequently, ANN quantization can be regarded as a viable and effective tool for adaptation towards a given target platform, as long as some evaluation is performed.

First, let us comment on how the quantization of an ANN affects its verification time of its safety properties. First, recall that verifying quantized neural networks is PSPACE-hard, as proven by [47] . However, this is a theoretical worst case, and existing empirical results in [39] show a positive correlation between the number of bits used in a quantized representation and the total verification time. Here, we show that this correlation holds only for small number of bits and specific safety properties, and there is no general trend for word lengths equal or longer than 16 bits.

To this end, we run our Iris and Vocalic benchmarks with a broad range of quantization levels, covering the span between the common word lengths of 8, 16 and 32 bits, and extending to smaller word lengths with zero fractional bits. We present such results in Fig. 16a . Note that there is a general upwards trend in verification time for short word lengths (from 6 − 7 to 15 bits), but this phenomenon almost disappears for longer word lengths (16 bits and above). Moreover, results are spread across six orders of magnitude, thus it is difficult to prove the existence of a true correlation in the associated data. In fact, applying common summary statistics (e.g., median verification time like in [39] ) shows only a partial correlation between time and quantization for the Iris benchmarks, and none for the Vocalic ones. A better understanding can be extracted by selecting individual safety properties and comparing their verification time across different quantization levels. We do so in Fig. 16b , where we choose six properties from Fig. 16a that showcase the full range of behaviors. More specifically, we broadly observe three different behaviors. First, properties like Vocalic "A" 10 and Vocalic "O" 120 exhibit almost identical verification time across all quantization levels. Second, properties like Iris Setosa 10 and Iris Versicolor 40 are somewhat erratic across quantization levels. However, their verification time falls into a limited range, where no systematic trend emerges. Third, properties like Vocalic "U" 40 and Iris Virginica 5 have verification time that is mildly correlated with the quantization level.

Overall, we believe that the quantization level has only a minor impact on the hardness of the verification problem from a practical perspective. Other factors, like the number of active neurons or the size of input regions of a given safety property, are probably better predictors regarding verification time. However, since these are beyond the scope of the present paper, we leave a thorough exploration of them to future work, where we might establish predictors and bounds.

Another aspect regards verification outcomes, where narrower bit widths deserve some discussion. Here, we take the results of the same experiments shown in Section 4.4.1 and plot, in Fig. 17 , a summary of how many safety properties are declared safe (S), generate a counterexample (F), or result in timeout (TO). As the figure shows, the percentage of successful safety properties is stable across quantization levels. The only noticeable differences happen in the Iris and Vocalic benchmarks for small word lengths. In the former, we observe a sudden drop in the number of safe properties between 6 and 7 bits, which goes through behavior that resembles transient responses in control systems [18] , until a more suitable representation is achieved (12 bits). In addition, with 6 bits, all safety properties are declared safe, which is indeed due to differences caused by computation with quantized values (see Section 2.2). Moreover, one may notice a clear trend related to more comprehensive formats, indicating an increasing number of correct operations.

We observe a higher incidence of undecidable safety properties regarding the Vocalic benchmarks that lead to a timeout. Note, however, that the Vocalic ANN is larger than Iris. Thus, more timeout events are expected due to the additional computational complexity, which is also worsened by the chosen representation. Again, stability regarding verification outcome is only achieved when a more suitable representation is used (14 bits) .

In this context, some conclusions can be drawn. Indeed, there is a clear relationship between data representation and safety-property verification when using restricted formats. In addition, it becomes negligible when more bits are used. Moreover, arbitrarily small representations should not be carelessly used, as erratic behavior may be experienced. A more focused picture of the relationship between quantization and verification outcome can be extracted by looking at individual safety properties. To this end, we report, in Tables 2 and 3 , all safety properties that have different outcomes across quantization levels. There, we can see two completely opposite behaviors. On the one hand, properties like Vocalic "A" 20 , Vocalic "I" 10 , Iris Versicolor 50 and Iris Virginica 50 are only safe for very short word lengths. On the other hand, properties like Vocalic "U" 20 and Iris Versicolor 40 tend to be safe as the word length increases.

Number of bits Property 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Set. 40 Table 3 . Vocalic safety properties with different verification outcomes across quantization levels.

A possible explanation of this behavior is that the properties listed in Tables 2 and 3 are on the verge of breaking the ANN robustness. In fact, for each of these properties, reducing the input region size or makes them more safe, and increasing it makes them less safe. Thus, by staying on the threshold between these two regimes, any minor change in the ANN implementation (e.g., the quantization level) can easily flip the verification outcome and yield the erratic results we observe.

Indeed, there is no prediction methodology or technique capable of indicating that such behavior will occur; however, the present work successfully reveals it and hints at how to devise a suitable scheme. For instance, a closed-loop approach could test a chosen set of properties against some possible quantization levels so that the realization with the smallest number of bits that still provide an output that is considered stable is chosen. This way, we could navigate through a response dependent on the target quantization level and, when a steady-state region is achieved, the narrowest format that allows this behavior is chosen. Currently, we can only validate or not a given quantized ANN; we leave the synthesis of neural net implementations as future work.

Finally, let us comment on the effect of quantization on the counterexamples returned by our verification approach. As Tables 2 and 3 show, the verification outcome of the same safety property changes depending on the chosen ANN representation. This is because different quantization granularities may either hide or reveal specific vulnerabilities in ANN computation. At the same time, even if the verification outcome is the same and the safety property is kept falsifiable (F), the counterexamples returned by the verification engine may be different.

Here, we present a qualitative comparison between counterexamples, which, in the context of machine learning research, are also known as adversarial examples. Specifically, we focus on our three levels of fixed-point quantization, namely 8, 16 and 32 bits, for which we present a selection of adversarial examples from the Vocalic benchmarks in Figs. 18a, 18b, and 18c respectively. Each figure contains pairs of images, where the center of the input region is on the left, and its corresponding adversarial example is on the right. For each safety property, we report the centroid and associated counterexample, the input region dimension , and the incorrect output label that was generated.

As Figures 18a, 18b, and 18c show, the granularity of ANN quantization has a significant effect on the quality of the adversarial examples. On the one hand, coarser fixed-point representations, such as ⟨4, 4⟩, restrict the search space to fewer gray-scale levels, which is clearly seen in Fig. 18a , given the easily noticeable differences. On the other hand, finer quantizations, such as ⟨16, 16⟩, let the verification engine produce counterexamples with minimal noise spread across the whole image (see Fig. 18c ). The latter is typical of floating-point ANNs. Besides, it is especially dangerous for this specific case of image classification, as such adversarial examples may go undetected even by a human observer [90] . It is important to notice that quantization levels also influence counterexamples. Moreover, they can be regarded as adapted to the contexts created by the latter.

These results successfully answer RQ2 -Quantization effects: we established that the verification time has some correlation with the number of bits of the ANN quantization. Moreover, we showed that the safety of an ANN is mostly stable across different quantization levels, which supports the use of aggressive quantization in machine learning practice as long as some verification is performed.

This section compares our verification methodology with existing works in the literature. We note that the field is progressing very rapidly at the time of writing, and thus the present comparison is limited to what tools are currently available. Namely, the few existing approaches for verifying quantized ANNs (Giacobbe et al. [39] , Baranowski et al. [5] , Kai Jia et al. [56] , Guy Amir et al. [4] ) do not provide reliable source code to replicate their experiments. As such, we can only compare our methodology with earlier tools that verify the safety of ANNs as abstract mathematical models, i.e. in infinite precision. Among those, we choose the two most popular ones as follow:

• Marabou [62] . Based on the earlier tool Reluplex [61] , Marabou uses a simplex-like algorithm to split the verification problem in smaller subproblems and invoke an SMT solver on each of them. • Neurify [96] . It uses symbolic intervals to over-approximate the ReLU non-linearity of each neuron, and turn the verification process into finding the solution of a linear problem. These over-approximation are iteratively tightened by splitting each ReLU activation function in two independent linear problems.

The goal of this comparison is showing that our quantized methodology is at least as efficient as these two state-of-the-art tools. At the same time, notice that our methodology provides more information on the safety of the actual ANN implementations than the abstract safety guarantees provided by tools like Marabou and Neurify.

To this end, we choose the AcasXu benchmark as our comparison suite (see Section 4.2.3). This benchmark has the advantage of being already implemented in both Marabou and Neurify, 6 7 thus allowing us to run the authors' code for a fair comparison of their performance. Furthermore, the neural networks in the AcasXu benchmark contain ReLU activation functions exclusively, which makes them compatible with Neurify. In a similar vein, we focus our comparison on safety property 1 of the AcasXu benchmark, since it is the one that incurs the fewest time-outs with the aforementioned verification tools [62, 97] . Note that 45 different neural networks need to be verified for each safety property of AcasXu, thus giving us a larger enough sample size for a significant comparison. Regarding our verification methodology, we choose a 32 bit representation with 28 integer bits (including sign), which are needed to avoid overflows.

The summary of our results regarding verification time are shown in Figures 19 and 20 . On the one hand, in Fig. 19 , we compare our methodology with the SMT-based tool Marabou. Note how our verification methodology is considerably faster than Marabou. We believe this is because our underlying model checker, ESBMC, is more efficient at producing optimized SMT formulae (see Section 3.8) than the custom simplex-like method employed by Marabou [62] . This also explains why the verification times of our methodology are almost constant across the whole comparison suite.

On the other hand, in Fig. 20 , we compare our methodology with the symbolic interval tool Neurify. Note that this tool has been released by the authors as a multi-threaded software [96] . This is the version we compare to in Fig. 20a . However, for the sake of a fair comparison with our methodology, we also present a modified version of Neurify that uses only a single thread in Fig. 20b . 8 Note how the multi-threaded version of Neurify is faster than our methodology in a majority of cases. This is because, on our machine, the multi-threaded Neurify uses up to 22 processors in parallel, giving it an obvious advantage over our single-threaded methodology. At the same time, such advantage disappears for the single-threaded version: more specifically, for the latter our methodology is faster in verifying 24 out of the 45 neural networks. For completeness, we also report the verification outcomes on all 45 benchmarks in Table 4 . Note how both Neurify and our methodology are able to successfully verify all 45 neural networks, whereas Marabou incurs a time-out for 13 of them. These results confirm that our methodology offers a comparable performance to that of Neurify, and faster than the state-of-the-art tool Marabou. Note also that our methodology offers guarantees on the actual implementation of the ANNs, e.g. the ones that would be deployed on an autonomous aircraft in the AcasXu case, thus making it more attractive for practical scenarios where the safety of a deployed system is critical. Benchmark  Tool  1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 2_1 2_2 2_3 2_4 2_5 2_6 2_7 2_8 2_9 3_1 3_2 3_3 3_4 3_5  Marabou  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  TO TO TO  S  S  S  S  S  Neurify  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  Ours <28, 4>  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S These results successfully answer RQ3 -Comparison with SOTA: We evaluated and compared our tool with other state-of-the-art tools, including SMT-based verification and symbolic intervals such as Marabou and Neurify, respectively. In terms of correctness, our approach can successfully verify all the benchmarks without timeout or crash. Furthermore, considering AcasXu property 1 benchmarks, our approach is significantly faster and solves more verification tasks than Marabou, a competitive opponent in SMT-based verification.

We believe the work we present in this paper is an essential milestone for verifying fixed-and floating-point ANNs with arbitrary activation functions. This way, it can be considered a unified quantization framework, with the potential of broad model exploration and verification regarding data representation. However, we still want to highlight a few limitations of our verification approach that need to be addressed in future work. First, we handle non-linear activation functions by replacing them with lookup tables (see Section 3.3). This step is necessary for efficiency reasons but has a drawback: even with a proper resolution, a lookup table will always approximate an original function. Our experiments used lookup tables with a resolution of three decimal places and correctly validated all adversarial cases with MATLAB. However, we cannot exclude that our verification approach may produce incorrect adversarial examples or successful verification outcomes in other ANN verification scenarios, especially when a given lookup table's resolution does not match the adopted quantization granularity. This constitutes a potential threat to the validity of our method.

Second, the biggest challenge in ANN verification is scaling to large neural networks. In this regard, our Iris and Vocalic benchmarks are small to medium-sized regarding the number of neurons. Furthermore, the dataset themselves is small, which probably generated ANNs with low robustness to adversarial attacks. Both these factors contribute to keeping the dimensionality of resulting SMT formulae low and thus help our method achieve competitive verification times. However, a thorough investigation of which factors hamper verification performance and overcome them is still required.

Finally, quantized frameworks do not usually publish the code of their methods, which compromises any direct comparison attempt. Even so, the research presented here is itself SOTA and can pave the way for further research towards ANN deployment in restricted systems based on formal guarantees.

This work's main contribution is providing a sound verification approach for checking the safety of MLPs with arbitrary activation functions and taking into account FWL effects in computations (weights, bias, and operations) due to fixed-point implementation, in addition to activation function discretization. SMT-based approaches [51, 61, 62, 80, 86] have been used for safety verification of ANNs. Besides, the main advantage of those techniques lies in SMT solvers' soundness; however, there is an important drawback: the scalability is limited since they are sensitive to the ANN complexity. For this reason, most of them are unable to deal with large ANNs.

Wang et al. [96] propose an efficient approach for checking different safety properties of large neural networks, aiming at finding adversarial cases. Their approach is based on two main ideas. First, symbolic linear relaxation combines symbolic interval analysis and linear relaxation to create an efficient propagation method for tighter estimations. Second, directed-constraint refinement, which identifies nodes whose output is overestimated and iteratively refines their output ranges. Those two techniques are implemented in a tool called Neurify that was validated against multiple ANN architectures. Furthermore, to scale up their verification framework, they have implemented their code using multi-threaded programming techniques. However, as the previous tools [51, 61, 80] , Neurify only supports ReLU activation functions. Katz et al. [62] present Marabou that extends the Reluplex approach and uses lazy search to deal with nonlinearities of activation functions, allowing verification of ANNs with any piecewise-linear activation functions.

Recently, set-theoretic methods for reachability-based verification have been proposed for verifying ANN-controlled closed-loop systems. In particular, Tran et al. propose the NNV tool [92] , which over-approximates the exact reachable set by approximating the exact reachable set after applying an activation function. It allows support to hyperbolic tangent and sigmoid activation functions. Other approaches [49, 54] also employ set-theoretic methods and polynomial approximation of hyperbolic tangent and sigmoid, using Taylor's [54] or Bernstein's [49] polynomials. Our approach also allows verifying ANNs with non-linear activation functions. This approximation is based on lookup tables create with a suitable number of intervals (i.e., expected error) to avoid use of non-linear operators' in SMT solvers. This approach allows support to any piecewise continuous activation function.

Robustness and explainability are the core properties of the present study, and applying those properties to ANNs has shown impressive experimental results. Explainability showed a vital property to evaluate safety in ANNs: the core idea is to obtain an explanation for an adversarial case by observing the pattern activation behavior of a subset of neurons described by a given invariant. Gopinath et al. presented formal [40] and data-driven [41] techniques to extract properties from ANNs, which may be used as formal specifications for the ANNs. It is a crucial result to ensure explainable adversarial examples.

Robustness is the ability to ensure safe outputs under the presence of disturbances and uncertainties, such as input noises and implementation issues [39] . In this sense, Dey et al. [29] provide a parametric regularization methodology to improve the robustness of ANNs concerning additive noise. However, sensitivity to FWL effects is not considered in that approach. ANNs are usually designed to work in real arithmetic; however, it is already shown that safety violations may occur due to the floating- [55] and fixed-point [39] implementations. In particular, Baranowski et al. presented a practical SMT-based approach for verifying neural networks' properties considering fixed-point arithmetic. Their approach employs a realistic model of FWL effects that includes different rounding and overflow models. However, as shown by Henziger et al. [47] , the scalability of this kind of approach is compromised due to the hardness of the verification of fixed-point implementations of ANNs. Therefore, a new method for verifying fixed-point implementations based on abstract interpretation is proposed in [47] to reduce complexity and increase scalability. However, that method can only verify ANNs with piecewise linear activation functions since it does not consider the propagation of FWL effects through generic non-linear functions. Our approach also considers FWL effects of fixed-point implementations of ANNs based on an efficient FWL implementation model that reduces complexity when verifying those ANNs. Our experiments and previous work on verification of fixed-point digital controllers [19] indicated that scalability is not compromised by the use of this FWL implememntation model.

Our approach implemented on top of ESBMC has some similarities with other techniques described here, e.g., regarding the covering methods proposed by Sun et al. [89] , model checking to obtain adversarial cases proposed by Huang et al. [51] , and incremental verification of ANNs implemented in CUDA by Sena et al. [86] . However, the main contribution concerns our requirements and how we handle, with invariant inference, actual implementations of ANNs with non-linear activation functions, also considering FWL effects. Moreover, the latter results in promptly deployable ANNs, which could be integrated into a unified design framework. Only ANNs' weights, bias descriptors, and desired input regarding a dataset are required to run our proposed safety verification. For tools such as DeepConcolic [89] and DLV [51] , obtaining adversarial cases or safety guarantees in customized ANNs depends on the intrinsic characteristics of models. For instance, in their implementations, they do not support complex non-linear activation functions. Moreover, Sena et al. [86] do not exploit invariant inference to prune the state space exploration, which is done in our proposed approach.

Verification of ANNs has recently attracted considerable attention, with notable approaches using optimization, reachability, and satisfiability methods. While the former two promise to scaling to large neural networks, they achieve such a goal by relaxing and approximating the verification problem. In contrast, satisfiability methods are exact by construction but are confronted with the full complexity of the original verification problem.

In this paper, we propose a satisfiability modulo theory (SMT) approach to address ANN verification. More specifically, we view the ANN not as an abstract mathematical model but as a concrete piece of software (i.e., source code), which performs a sequence of fixed-or floating-point arithmetic operations. We can borrow several techniques from software verification and seamlessly apply them to ANN verification with this view. In this regard, we center our verification framework around software model checking (SMC) and empirically show the importance of interval analysis, constant folding, tree balancing, and slicing in reducing the total verification time. Furthermore, we propose a tailored discretization technique for non-linear activation functions that allow us to verify ANNs beyond the piecewise-linear assumptions that many state-of-the-art methods are restricted to.

Besides, in our experimental evaluation, we uncovered an important relationship between the granularity of ANN quantization and verification time and the correctness of its properties. The more granular the quantization, the more significant the search space and thus the more prolonged the verification time. This is contrary to the main existing theoretical result in the literature, which states that verifying quantized ANNs is computationally harder than verifying real-valued ones. However, further research is needed to shed more light on this phenomenon. Regarding correctness, we verified that narrower bit widths can be used but must be verified before deployment to achieve the minimum format that still provides broadly correct results. However, when that minimum representation is obtained, more comprehensive formats will usually provide correct results, as the stationary response of a curve relating bit width and verification result.

We have also evaluated and compared our tool with Marabou and Neurify. Considering the ACASXu property one benchmarks [58] , we have observed that our approach is significantly faster and solves more verification tasks than Marabou [62] , a competitive opponent in SMT-based verification. However, in many cases, Neurify [96] is faster than our tool since it deploys a multithreaded algorithm to solve the verification tasks. However, note that neither Marabou nor Neurify can verify quantized neural networks as in our approach.

Finally, we believe that the problem of verifying ANNs is still open. More specifically, it is unclear which set of techniques yields the best performance when scaling to large networks. In this regard, our future work includes comparing our verification approach to other existing techniques and optimizing our verification performance even further. In addition, the results of our work can be regarded as the first steps towards an approach capable of revealing the most aggressive ANN representation that still provides correct operation, which aims at achieving maximum compression for a particular model.

Tensorflow: A system for large-scale machine learning

OptCE: A Counterexample-Guided Inductive Optimization Solver

Recognizing safety and liveness

An SMT-based approach for verifying binarized neural networks

Thanh Son Nguyen, and Zvonimir Rakamarić. 2020. An SMT Theory of Fixed-Point Arithmetic

Cvc4. In International Conference on Computer Aided Verification

The Satisfiability Modulo Theories Library (SMT-LIB)

The smt-lib standard: Version 2.0

Satisfiability modulo theories

Measuring Neural Net Robustness with Constraints

Software verification: 10th comparative evaluation (SV-COMP 2021). Tools and Algorithms for the Construction and Analysis of Systems

CPAchecker: A Tool for Configurable Software Verification

Pattern Recognition and Machine Learning

A Lesson on Verification of IoT Software with Frama-C

Efficient Verification of ReLU-Based Neural Networks via Dependency Analysis

Boolector: An efficient SMT solver for bit-vectors and arrays

EVA, an Evolved Value Analysis for Frama-C : structuring an abstract interpreter through value and state abst Ph

DSVerifier-Aided Verification Applied to Attitude Control Software in Unmanned Aerial Vehicles

Verifying fragility in digital systems with uncertainties using DSVerifier v2.0

Verification of Binarized Neural Networks via Inter-neuron Factoring

Introduction to Model Checking

SMT-based bounded model checking of multi-threaded software in embedded systems

Uncertain Data Modeling Based on Evolving Ellipsoidal Fuzzy Information Granules

Efficiently computing static single assignment form and the control dependence graph

Program slicing: Methods and applications

Z3: An efficient SMT solver

Probabilistic Inference Modulo Theories

Parallel tree techniques and code optimization

Regularizing Multilayer Perceptron for Robustness

UCI Machine Learning Repository

Yices 2.2

A dual approach to scalable verification of deep networks

Robust Physical-World Attacks on Deep Learning Visual Classification

Safety Verification and Robustness Analysis of Neural Networks via Quadratic Constraints and Semidefinite Programming

ESBMC v6.0: Verifying C Programs using k-Induction and Invariant Inference

ESBMC v6.0: Verifying C Programs Using k-Induction and Invariant Inference -(Competition Contribution)

ESBMC 5.0: an industrial-strength C model checker

ESBMC: Scalable and Precise Test Generation based on the Floating-Point Theory -(Competition Contribution)

How Many Bits Does it Take to Quantize Your Neural Network

Property Inference for Deep Neural Networks

DeepSafe: A Data-Driven Approach for Assessing Robustness of Neural Networks. In Automated Technology for Verification and Analysis

Supervised Sequence Labelling with Recurrent Neural Networks

Deep learning with Keras

Incremental bounded software model checking

A Survey on Methods and Theories of Quantized Neural Networks

Gaussian Error Linear Units (GELUs)

Scalable Verification of Quantized Neural Networks

Divide and Slide: Layer-Wise Refinement for Output Range Analysis of Deep Neural Networks

ReachNN: Reachability Analysis of Neural-Network Controlled Systems

A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability

Safety verification of deep neural networks

Binarized Neural Networks

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

Verifying the Safety of Autonomous Systems with Neural Network Controllers

Exploiting Verified Neural Networks via Floating Point Numerical Error

Verifying Low-dimensional Input Neural Networks via Input Quantization

Certified Robustness to Adversarial Word Substitutions

Policy compression for aircraft collision avoidance systems

Verification of Loop and Arithmetic Transformations of Array-Intensive Behaviors

LaVAN: Localized and Visible Adversarial Noise

Reluplex: An efficient SMT solver for verifying deep neural networks

The Marabou Framework for Verification and Analysis of Deep Neural Networks

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

CBMC -C Bounded Model Checker

Gradient-based learning applied to document recognition

Fixed Point Quantization of Deep Convolutional Networks

Algorithms for Verifying Deep Neural Networks

A Unified Approach to Interpreting Model Predictions

ESBMC-GPU A context-bounded model checking tool to verify CUDA programs

Introduction to Interval Analysis

Expressive and efficient bounded model checking of concurrent software

ESBMC 1.22

ESBMC 1.22 -(Competition Contribution)

Rectified Linear Units Improve Restricted Boltzmann Machines (ICML'10)

Verifying Properties of Binarized Deep Neural Networks

2020. A Novel Medical Diagnosis model for COVID-19 infection detection based on Deep Features and Bayesian Optimization

GPU implementation of neural networks

Taming the waves: sine as activation function in deep neural networks

Pytorch: An imperative style, high-performance deep learning library

Challenging SMT solvers to verify neural networks

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators

DepthK: A k-Induction Verifier Based on Invariant Inference for C Programs -(Competition Contribution)

Advances in verification of ReLU neural networks

Metric Spaces

Incremental Bounded Model Checking of Artificial Neural Networks in CUDA

Fast and Effective Robustness Certification

Implicit neural representations with periodic activation functions

Structural Test Coverage Criteria for Deep Neural Networks

Intriguing properties of neural networks

Evaluating Robustness of Neural Networks with Mixed Integer Programming

NNV: The Neural Network Verification Tool for Deep Neural Networks and Learning-Enabled Cyber-Physical Systems

Verification of Neural Network Behaviour: Formal Guarantees for Power System Applications

Boolean Satisfiability Solvers and Their Applications in Model Checking

Deep Neural Network Approximation for Custom Hardware: Where We've Been, Where We're Going

Efficient Formal Safety Analysis of Neural Networks

Formal Security Analysis of Neural Networks Using Symbolic Intervals

SDLV: Verification of Steering Angle Safety for Self-Driving Cars. Formal Aspects of Computing

A Comprehensive Survey on Graph Neural Networks

Output Reachable Set Estimation and Verification for Multilayer Neural Networks

FPGA Implementations of Neural Networks -A Survey of a Decade of Progress

Using Algebraic Transformations to Optimize Expression Evaluation in Scientific Code