key: cord-0058621-mb9n7dfe
authors: ElGhondakly, Roaa; Moussa, Sherin; Badr, Nagwa
title: Handling Faults in Service Oriented Computing: A Comprehensive Study
date: 2020-08-19
journal: Computational Science and Its Applications - ICCSA 2020
DOI: 10.1007/978-3-030-58811-3_67
sha: ac147372c7af79fa1719d56ce69c499e3e240d49
doc_id: 58621
cord_uid: mb9n7dfe

Recently, service-oriented computing paradigms have become a trending development direction, in which software systems are built using a set of loosely coupled services distributed over multiple locations through a service-oriented architecture. Such systems encounter different challenges, as integration, performance, reliability, availability, etc., which made all associated testing activities to be another major challenge to avoid their faults and system failures. Services are considered the substantial element in service-oriented computing. Thus, the quality of services and the service dependability in a web service composition have become essential to manage faults within these software systems. Many studies addressed web service faults from diverse perspectives. In this paper, a comprehensive study is conducted to investigate the different perspectives to manipulate web service faults, including fault tolerance, fault injection, fault prediction and fault localization. An extensive comparison is provided, highlighting the main research gaps, challenges and limitations of each perspective for web services. An analytical discussion is then followed to suggest future research directions that can be adopted to face such obstacles by improving fault handling capabilities for an efficient testing in service-oriented computing systems.

Service-oriented systems are built through combining a set of distributed services that are previously developed for reuse. These loosely coupled services communicate with each other through a predefined protocol, forming a service composition to achieve a complex task [1, 2] . However, the performance and quality of the individual services might be higher than those of the resultant service composition, due to some integration issues, unavailability of services, dependability between services, etc., leading to faults occurrence. Four main mechanisms are investigated to achieve service dependability facing service compositions with respect to the quality, performance and reliability: fault avoidance, fault removal, fault prediction and fault tolerance. Fault avoidance methods minimize the occurrence of faults, whereas fault removal methods aim to remove faults as soon as possible. Fault prediction techniques predict the number of faults before occurrence, while fault tolerance detects faults and tries to solve them early from the system [3] . In this paper, a comprehensive study is conducted to investigate the studies handling web service faults with respect to different perspectives as shown in Fig. 1 . A thorough analytical evaluation is discussed for the challenges and limitations encountered in each perspective in the state of art, revealing the main research gaps. In addition, a set of future directions are proposed for powerful testing in service-oriented computing (SOC) systems. The rest of the paper is organized as follows: Sect. 2 discusses the main research studies that tackle faults in service-oriented systems according to the given hierarchy. Section 3 analyzes the revealed challenges and limitations, whereas Sect. 4 proposes various future directions to consider maximizing testing benefits in service-orientation. Finally, Sect. 5 concludes the proposed study.

Many studies have addressed diverse perspectives of faults handling in Service-oriented computing (SOC) systems, as fault tolerance, fault prediction, fault injection, etc. Following the hierarchy in Fig. 1 , a comprehensive discussion is presented to investigate each perspective of faults handling in the following sub-sections.

Fault tolerance allows a system to keep on operating while one or more of its components stop working or tend to fail [4] . Thus, fault tolerance should consist of a detection method and a recovery method. Most of the proposed approaches in this perspective required to combine other fault mechanisms in order to achieve fault tolerance, like fault localization, fault injection and fault recovery. Gupta et al. [5] , considered fault tolerance from the web service composition point of view, taking into consideration some quality of service (QoS) attributes. A fault injection mechanism was used to inject two types of faults: network faults, by examining the unavailability of a service, and logical faults by observing incorrect values. Authors then tried to detect their existence and recover from them, without trying to find the already existed faults. In addition, only two types of faults were considered, ignoring any other types of faults.

In [6] , the authors studied two existing fault tolerance mechanisms: Time redundancy and space redundancy, with their three strategies: Retry, Active replication and Passive replication. In Retry strategy, if a service fails, it would retry and run frequently to either work correctly or reach the limit of trials. Active and Passive replications depend on using replicas to overcome failures. In Active strategy, it invokes all services and the first returned suitable one is used, whereas in Passive strategy, the primary service is invoked first, then if it fails, a backup replication is used instead. They proposed integrated simulation algorithms combining all three strategies to control services and modify the ones that did not meet the desired requirements in the form of time, space, etc. However, although the simulation has given high reliability, they did not consider reliability after applying actual data. Fekih et al. [7] used a sensor-based method to detect and find faults' location, and then used two repair algorithms to correct faults before system crash; single service reconfiguration (SSR) and multiple service reconfiguration (MSR) with a metaheuristic search algorithm. However, they did not evaluate the efficiency of their results within comparison with other approaches. Authors in [8] used two ranking algorithms HITS based Service Component Ranking Algorithm (HSCR) and PageRank based Service Component Ranking Algorithm (PSCR) to rank services, where top services could be used later since their fault tolerance ability would be high. However, they encountered critical drawbacks, as they did not consider neither fault detection nor localization methods, whereas fault tolerance should include detection and recovery methods. In addition, localization methods help in finding the location of faulty service, which eases the recovery process. They only proposed the optimum services set to use instead of the faulty ones, in which only simple services were investigated neglecting QoS attributes.

A fault recovery approach was utilized in [9] through an error handling procedure, but it suffered from its weak response time that affected the overall performance. Chen et al. [10] proposed a 3-phase mechanism, in which these phases were invocation, synchronization and exception with the aid of petri nets to show the behavior of services. The invocation phase selected the services for composition, the synchronization phase obtained matching services for the selected ones, while the exception phase excluded the services that may cause faults as well as its connected ones. This mechanism was inefficient when services' resources increase, affecting the reliability of the proposed approach. Authors in [11] proposed a fault tolerance approach for mobile ad hoc networks based on a checkpointing technique, in which the faulty services were recovered by substituting them with the ones that satisfied QoS constraints using a fuzzy-rated system. However, their evaluation was based on simulation rather than real world problems. Besides, they encountered a single point of failure issue that would affect the reliability and availability of their approach.

Another rejuvenation fault tolerance scheme was introduced in [12] , consisting of 3 sub phases: failure detection, age evaluation of components, and checkpointing method for rejuvenation. However, the results lacked accuracy as a fault detection metric, as well as maintaining availability. Miltiadis et al. in [13] proposed an approach for predicting the average execution time for faulty applications counting on a checkpoint mechanism. Although this approach improved the performance of faulty software, it did not propose a recovery method to overcome the existing faults. Authors in [14] investigated the impact of checkpointing interval selection on the cloud application performance, as well as finding the relation between the checkpointing interval and the failure probability. Yet, this study did not consider any fault recovery techniques. In [15] , a fault injection method was used to inject faults in an application. It detected such faults using a logging mechanism in order to assess the performance of the applied fault tolerance mechanism, without discussing how faults were handled or recovered. Moreover, Jhawar and piuri in [16] analyzed the characteristics and reasons for failure of 2 fault types: crash and byzantine faults, as well as their corresponding fault tolerance mechanisms. The studies in [17] and [18] have considered fault tolerance approaches only, discarding other fault handling perspectives, in which one study addressed fault tolerance with load balancing algorithms, while the other addressed checkpointing approaches. In [19] , Angarita et al. proposed a dynamic fault tolerance approach that could select the superior recovery strategy to correct faults based on analyzing some QoS parameters, the execution state and environment state. However, in practice, depending on the execution state and environment state criteria might not be possible [20] . Xu et al. in [21] discussed the impact of applying a fault recovery mechanism on time and cost, while addressing the problem of workflow scheduling in cloud computing via developing a heuristic-based algorithm. Nevertheless, they ignored that tasks might also fail, not only components.

Thus, most of the related studies applied fault detection and recovery to fulfill fault tolerance process, while ignored by few of them. Some have taken into consideration the QoS for its great impact on fault existence. Moreover, some studies applied their proposed approaches on simulated data rather than real data, which affected the truthfulness of their results.

Fault prediction is the process of predicting whether a software is faulty or not, as well as the number of faults that may appear in a software [22] . There exist only few studies that investigated fault prediction in the service-oriented computing paradigm. In [23] , the authors proposed a framework to study the ability of source code metrics to predict faults in web services. They managed to predict whether a web service is faulty by applying 5 feature selection techniques to select such metrics. In addition, they considered 6 machine learning algorithms to predict faults: Naïve Bayes, Artificial Networks (ANN), Adaptive Boosting (AdaBoost), decision tree, Random Forests and Support Vector Machine (SVM). However, the main drawback was that they could not predict the number of faults. Ding et al. in [24] predicted the service composition reliability and located the faulty services using a spectrum-fault-localization technique, but the proposed method was not able to predict the number of faults as well. Authors in [25, 26] investigated some machine learning techniques for fault prediction, as classification algorithms and multi-layer perceptron, by developing a web service on Azure cloud platform to predict faulty services but neglected predicting the number of faults. Chatterjee et al. [27] proposed a prediction algorithm through integrating clustering and fuzzy algorithms to predict more than one fault at a time within one run, whereas in [28] , a clustering method was proposed to detect faults using k-means++. A new set of metrics was introduced in [29] to make the fault detection process more efficient as well as to increase its performance, but they only validated their metrics on PHP based projects. However, the different approaches proposed in [27, 28] and [29] focused on predicting faulty components but neglected to predict the type and number of faults.

Consequently, most of the previous studies focused on predicting faults in serviceoriented computing, ignoring the idea of predicting the number of faults, which would help in deciding how faulty the component is. In addition, it would allow controlling faults to increase services reliability and performance.

Fault localization is the process of locating the faulty code or component [30, 31] . Despite of the existence of many software fault localization techniques, there were only few studies directed to fault localization in SOC paradigms [32] . Sun et al. in [33] proposed a fault localization approach for WS-BPEL programs, where switching and slicing methods were used. This study addressed fault localization for the interaction section of WS-BPEL programs only, neglecting the other sections. An automatic modeling mechanism was proposed in [34] using the "Policy View" of each service to identify fault location by building a belief network per service, but they ignored the types of located faults. Some studies have combined fault localization techniques to other fault handling perspectives, as with fault tolerance in [7] and with fault prediction in [24] . These combinations improved the reliability and availability of services, where locating their faults facilitated their correction timely and effectively.

Hence, there is still an eager need to investigate this perspective, with the aim of increasing web services reliability and performance. This includes considering different types of located faults, investigating different fault localization techniques, and defining the optimum techniques to apply in SOC systems. In addition, combining fault localization techniques with more fault handling perspectives is an encouraging trend that would elevate high performance and efficient fault handling.

Fault injection is the process of injecting faults to the application to test its reliability and performance by trying to detect and remove such faults [35] . Qian et al. in [36] presented a fault injection process that counted on the place where faults can be injected and what type of faults can be injected. However, only sub-service stability coverage criterion was considered, in which more coverage criteria were highly required to evaluate the efficiency and reliability of the proposed approach. In [37] , an approach was proposed using fault injection techniques to assist the failure diagnosis when the same failure takes place. In addition, it used fault localization to detect the root cause of the failure and to recover the system through fixing such faults. Some limitations were encountered, as the proposed approach could neither diagnose all kinds of failures, nor support multiple distributed systems paradigms. Another combined method was proposed in [38] to estimate the dependability of distributed systems. Contract-based and model-based cyber physical systems (CPS) were combined, applying a fault injection method to check the system dependability of the resulting model.

Thus, many studies have used fault injection methods for multiple reasons, as to evaluate and test a system [36, 38] , to generate representative failure data [39] , to check performance and reliability [36, 37] , to assess dependability [40] and for security management [41, 42] . Furthermore, fault injection techniques were combined with other fault handling perspectives, as fault tolerance in [5] to increase the reliability and performance of the system under test. This indicates that there is still a growing interest on how and where to inject faults. A closer investigation is expected to study fault injection mechanisms with respect to SOC paradigms, considering its distributed nature and interoperability.

Despite of having software faults as a comprehensive research field that has diverse contributions, handling faults in SOC is still insufficient. This is due to the different nature of web services rather than regular software systems, in which extra testing is highly demanded, and faults can occur easily for various reasons. In accordance with the related state of art, most of the previous studies considered fault tolerance as the main perspective for faults handling in service-oriented computing paradigms. Some combined it with other perspectives to boost the reliability of the system under test. A serious lack is witnessed in the research efforts proposed for the other fault handling perspectives. This raises many challenges and limitations as follows:

• Detecting many types of faults and their count is a persistent challenge, where the proposed studies did not cover multiple fault types, neglecting the number of faults. They only determined whether there exists a fault or not by injecting faults and trying to detect/localize them, ignoring the detection of actual faults rather than synthetic ones. • Recovery strategies and fault prediction techniques require vital advancement with respect to the fault localization studies, in which only synthetic fault injection processes were considered, which lack to represent the diversity of faults that SOC systems can face. This would increase the performance and reliability of such systems.

• The need for language-independent approaches, in which some studies proposed approaches for specific types of applications written in certain languages only, as WS-BPEL, PHP projects, etc.

• Semantic faults need to be considered, where most of the current studies detected faults that occur due to integration issues between components, unavailability of one of the composition services, etc., while ignoring faults that occur due to a semantic issue between services.

• Handling faults in SOC systems should consider faults leading to task failure, not only faults leading to service failure. This would require new fault localization and injection approaches to address this different level of granularity. • Investigating how and where to inject faults to improve the performance.

• Most of the current approaches showed high efficiency using simulated services, whereas in practice, these approaches could be inapplicable for real world applications. • Addressing different evaluation measurements, metrics and coverage criteria to ensure the quality of the system under test. • Consider other fault handling perspectives as fault prevention.

• The proposed approaches need to be more generic and applicable for different distributed systems. Table 1 summarizes all deduced drawbacks and challenges at the main fault handling studies in the SOC paradigm, associated with the corresponding fault handling perspective considered, applied mechanisms, measured evaluation criteria, and the datasets used to examine the applied mechanisms. 

New fault handling approaches in SOC systems should address the diversity of fault sources with respect to the interoperability and loose coupling of services, which generate emergent types of faults special to the nature of SOC systems. Future research should consider innovative approaches to anticipate the number of faults in a tested system, as well as discovering more fault types, as functionality faults, syntactic faults, control flow faults, semantic faults, service dependency in service compositions faults, etc. In addition, fault localization approaches should investigate hybrid methods by combining static, dynamic and execution slicing based methods and program spectrumbased methods, etc. [35] .

Moreover, system recovery after finding faults is a very important process to consider. Therefore, new recovery strategies should be addressed that fit the distributed architecture, increased federation, expanded intrinsic interoperability and vendor diversification of the SOC paradigm. Besides, new fault handling perspectives, like fault prevention/avoidance techniques, should be tackled in order to improve the performance and quality of SOC systems. In addition, merging multi-fault perspectives together should be investigated to increase the dependability and reliability of the tested system. Consequently, fault handling in SOC is still an emerging field of research that expects more investigations to follow.

Even though software faults occur frequently, in case of service-oriented computing (SOC) systems, faults are more likely to exist due to the nature of web services, in which the availability of some services is not guaranteed. In this study, we thoroughly investigated the main fault handling approaches that have been presented with respect to the SOC paradigms from different perspectives, including fault tolerance, fault prediction, fault localization, fault recovery and fault injection. A detailed comparison is conducted to analyze the challenges and limitations of the current fault handling approaches directed to the SOC systems. Moreover, a set of suggested future directions are proposed for efficient fault handling capabilities in SOC systems, in which current research efforts are still insufficient. 

Transaction policies for service-oriented computing

A survey of automated web service composition methods

Agent-Based Service-Oriented Computing

A comprehensive survey of fault tolerance techniques in cloud computing

A QoS-supported approach using fault detection and tolerance for achieving reliability in dynamic orchestration of web services

A simulation-based reliability analysis approach of the fault-tolerant web services

The dynamic reconfiguration approach for faulttolerance web service composition based on multi-level VCSOP

Fault tolerance for web service based on component importance in service networks

Fault tolerance in automatic semantic web service composition based on QoS-awareness using BTSC-DFS algorithm

A formal method to model and analyse QoS-aware fault tolerant service composition

Reliable fault tolerance system for service composition in mobile Ad Hoc network

Software rejuvenation based fault tolerance scheme for cloud applications

Optimum checkpoints for programs with loops

The impact of checkpointing interval selection on the scheduling performance of real-time fine-grained parallel applications in SaaS clouds under various failure probabilities

A methodology for evaluating fault tolerance in web service applications

Fault tolerance and resilience in cloud computing environments

Fault tolerance and load balancing algorithm in cloud computing: A survey

Survey on web services fault tolerance approaches based on checkpointing mechanisms

Modeling dynamic recovery strategy for composite web services execution

Self-adaptation of service compositions through product line reconfiguration

A multi-objective optimization approach to workflow scheduling in clouds considering fault recovery

A study on software fault prediction techniques

An approach for fault prediction in SOA-based systems using machine learning techniques

Online prediction and improvement of reliability for service oriented systems

A systematic review of machine learning techniques for software fault prediction

Development of a software vulnerability prediction web service based on artificial neural networks

Novel algorithms for web software fault prediction

A novel defect prediction method for web pages using k-means++

Predicting defect prone modules in web applications

A survey on software fault localization

An empirical study of fault localization families and their combinations

Survey of software fault localization for web application

Fault localisation for WS-BPEL programs based on predicate switching and program slicing

Automatic belief network modeling via policy inference for SDN fault localization

A survey of software fault localization

Fault injection for performance testing of composite web services

Failure diagnosis for distributed systems using targeted fault injection

Dependability assessment of SOAbased CPS with contracts and model-based fault injection

Towards assessing representativeness of fault injection-generated failure data for online failure prediction

Experimental assessment of cloud software dependability using fault injection

Security testing methodology for evaluation of web services robustness-case: XML injection

Analysis of web application security mechanism and attack detection using vulnerability injection technique

Towards dynamic reconfiguration for QoS consistent services based applications