key: cord-0464103-a4narryz
authors: T.Nguyen, Phuong; Rocco, Juri Di; Sipio, Claudio Di; Ruscio, Davide Di; Penta, Massimiliano Di
title: Recommending API Function Calls and Code Snippets to Support Software Development
date: 2021-02-15
journal: nan
DOI: nan
sha: c86fccc4725d246d11611374ceaa3125ea6c3d5f
doc_id: 464103
cord_uid: a4narryz

Software development activity has reached a high degree of complexity, guided by the heterogeneity of the components, data sources, and tasks. The proliferation of open-source software (OSS) repositories has stressed the need to reuse available software artifacts efficiently. To this aim, it is necessary to explore approaches to mine data from software repositories and leverage it to produce helpful recommendations. We designed and implemented FOCUS as a novel approach to provide developers with API calls and source code while they are programming. The system works on the basis of a context-aware collaborative filtering technique to extract API usages from OSS projects. In this work, we show the suitability of FOCUS for Android programming by evaluating it on a dataset of 2,600 mobile apps. The empirical evaluation results show that our approach outperforms two state-of-the-art API recommenders, UP-Miner and PAM, in terms of prediction accuracy. We also point out that there is no significant relationship between the categories for apps defined in Google Play and their API usages. Finally, we show that participants of a user study positively perceive the API and source code recommended by FOCUS as relevant to the current development context.

When dealing with certain programming tasks, rather than implementing new systems from scratch, developers often make use of third-party libraries that provide the desired functionalities. Such libraries expose their functionality through Application Programming Interfaces (APIs) which govern the interaction between a client project and its incorporated libraries. To use a library in a proper way, it is necessary to use the correct sequence of API calls, known as API usage patterns. The knowledge needed to manipulate an API can be extracted from various sources: the API source code itself, the official website and documentation, Q&A websites such as StackOverflow, forums and mailing lists, bug trackers, other projects using the same API, etc. However, official documentation often merely reports the API description without providing non-trivial example usages. Besides, querying informal sources such as StackOverflow might become time-consuming and errorprone [37] . Also, API documentation may be ambiguous, incomplete, or erroneous [51] , while API examples found on Q&A websites may be of poor quality [24] . In this respect, we come across the following motivating question:

"Which API calls should this piece of code invoke, given that it has already invoked these API calls?" The problem of recommending API function calls and usage patterns has garnered considerable efforts and attention of the research community in recent years [23] , [58] . Several techniques have been developed to automate the extraction of API usage patterns [38] for reducing developers' burden when manually searching these sources, and for providing them with high-quality code examples. However, these techniques, based on clustering [54] , [58] or predictive modeling [11] , still suffer from high redundancy and poor run-time performance. Moreover, most of the existing approaches are based on clustering on data from code snippets to recommend API usage, which sustains redundancy.

In an attempt to transcend the limitations, we proposed FOCUS [26] as a novel approach to mining open-source software repositories to provide developers with API FunctiOn Calls and USage patterns. We aim to suggest to developers highly relevant API usages that ease the development process. Our tool distinguishes itself from other tools that recommend API usages as it can provide both function calls and real code snippets that match well with the developer's context. FOCUS has been built based on a collaborativefiltering recommender system [8] , whose fundamental principle is to recommend to users the items that have been bought by similar users in similar contexts. By considering API methods as products and client code as customers, we reformulate the problem of usage pattern recommendation in terms of a collaborative-filtering recommender system. We modeled the mutual relationships among projects using a tensor and mined API usage from the most similar projects.

Implementing a collaborative-filtering recommender system requires to assess the similarity of two customers, i.e., two projects. Existing approaches assume that any two projects using an API of interest are equally valuable sources of knowledge. On the contrary, we postulate that not all projects are equal when it comes to recommending usage patterns: a project that is highly similar to the project currently being developed should provide higher quality patterns than a highly dissimilar one does. Our recommender system attempts to narrow down the search scope by considering only the projects that are the most similar to the active project. Thus, methods that are typically used conjointly by similar projects in similar contexts tend to be recommended first.

The first prototype of FOCUS [26] has been successfully realized and integrated into the Eclipse IDE, and it is available for download. 1 An empirical evaluation has been conducted on a large number of Java projects extracted from GITHUB and the Maven Central repository to study FOCUS's performance, and to compare it with a state-ofthe-art tool for API usage patterns mining, i.e., PAM [11] . We simulated different stages of a development process, by removing portions of client code and assessing how FOCUS can recommend snippets with API invocations to complete the code being developed. The experiments showed that FOCUS outperforms PAM, with regards to success rate, accuracy, and execution time.

In this paper, we further extend the evaluation to study if FOCUS can assist mobile developers in finding relevant API function calls as well as real code snippets by means of an Android dataset. In this sense, the main contributions of our work are summarized as follows: (i) We propose FOCUS as a practical solution to API recommendations employing a context-aware collaborative filtering recommender system; (ii) Through a comprehensive evaluation on a dataset collected by mining GITHUB and Google Play, we show that FOCUS is applicable to Android programming, outperforming two state-of-the-art API recommenders; (iii) We investigate how the calibration of the various FOCUS parameters can influence its performance; (iv) We show that the system is capable of achieving good performance regardless of the availability of an extensive training set within the same app category; (v) By means of a clone detection evaluation and a user study, we show that FOCUS can be used to recommend real code snippets relevant for the program artifact being developed; (vi) Finally, we make available both the FOCUS tool and the datasets related to our paper to allow for future replication [27] .

The paper is organized as follows. Section 2 introduces a motivating example and background notions. Section 3 brings in FOCUS, our proposed solution to API recommendation. The materials and methods used to evaluate the approach are presented in Section 4, while Section 5 analyzes the key results. The lessons learned and the threats to validity are discussed in Section 6. We present related work and conclude the paper in Section 7 and Section 8, respectively.

We describe a motivating example to introduce the problem addressed by FOCUS in Section 2.1. Section 2.2 gives an overview of the main components of the proposed solution; Afterwards, we introduce the main notions underpinning our approach, by reviewing the previous work by Schafer et al. [45] and Chen [8] .

The typical setting considered in the paper is shown in Fig. 1(a) : a programmer is implementing some methods to satisfy the requirements of the system being developed. The development is at an early stage, and the developer already used some methods of the chosen API to realize the required functionality. However, she is not sure how to proceed from 1. https://mdegroup.github.io/FOCUS-Appendix/ this point. Under such circumstances, the programmer may browse different sources of information, including Stack Overflow, video tutorials, official API documentation, etc. Figure 1 (b) depicts the final version of the snippet in Fig. 1(a) . In the framed code, the findBoekrekeningen method queries the available entities and retrieves those of type Boekrekening. To this end, the Criteria API library 2 is used as it provides useful interfaces for querying system entities according to the defined criteria. FOCUS has been conceptualized to do the exactly same thing: it is able to suggest to developers recommendations consisting of a list of API method calls that should be used next. Furthermore, it also recommends real code snippets that can be used as a reference to support developers in finalizing the method definition under development.

"Collaborative Filtering (CF) is the process of filtering or evaluating items through the opinions of other people" [45] . In a CF system, a user who buys or uses an item attributes a rating to it based on her experience and perceived value. Therefore, a rating is the association of a user and an item through a value in a given unit (usually in scalar, binary, or unary form). The set of all ratings of a given user is also known as a user profile [8] . Moreover, the set of all ratings given in a system by existing users can be represented in a so-called rating matrix, where a row represents a user, and a column represents an item.

The expected outcome of a CF system is a set of predicted ratings (aka. recommendations) for a specific user and a subset of items [45] . The recommender system considers the most similar users (aka. neighbors) to the active user to suggest new ratings. A similarity function sim usr (u a , u j ) computes the weight of the active user profile u a against each of the user profiles u j in the system. Finally, to suggest a recommendation for an item i based on this subset of similar profiles, the CF system computes a weighted average r(u a , i) of the existing ratings, where r(u a , i) varies with the value of sim usr (u a , u j ) obtained for all neighbors [8] , [45] .

Context-aware CF systems compute recommendations based not only on neighbors' profiles but also on the context where the recommendation is demanded. Each rating is associated with a context [8] . Therefore, for a tuple C modeling different contexts, a context similarity metric sim ctx (c a , c i ), for c a , c i ∈ C is computed to identify relevant ratings according to a given context. Then, the weighted average is reformulated as r(u a , i, c a ) [8] .

To tackle the problem of recommending API function calls and usage patterns, we leverage the wisdom of the crowd and existing recommender system techniques. In particular, we hypothesize that API calls and usages can be mined from existing codebases, prioritizing the projects that are similar to the one from where the recommendation is demanded. We start with a definition of the main components of our approach in Section 3.1. Afterwards, Section 3.2 presents in detail the conceived architecture.

2. https://docs.oracle.com/javaee/6/tutorial/doc/gjivm.html public List < Boekrekening > f in d B o ek r ek en i ng e n () { C ri te ria Bu il d er cb = entityManager . g e t Cr i te ri a B u i ld e r (); CriteriaQuery < Boekrekening > c r i t e r i a Q u e r y B o e k r e k e n i n g = cb . createQuery ( Boekrekening . class );

??

. } (a) Initial version public List < Boekrekening > f i n d B o ek r ek en i ng e n () { C ri te ria Bu il der cb = entityManager . ge t Cr i te ri a B u i l d e r (); CriteriaQuery < Boekrekening > c r i t e r i a Q u e r y B o e k r e k e n i n g = cb . createQuery ( Boekrekening . class );

Root < BoekrekeningPO > b o ek r e k e n in g Fr o m = c r i t e r i a Q u e r y B o e k r e k e n i n g . from ( Boekr ekeningPO . class ); c r i t e r i a Q u e r y B o e k r e k e n i n g . select ( b o ek r ek e nin gFr o m ); c r i t e r i a Q u e r y B o e k r e k e n i n g . orderBy ( cb . asc ( b o ek r e k e n in g F r om . get ( Boek re k eningPO _ . rekeningnr ))); return entityManager . createQuery ( c r i t e r i a Q u e r y B o e k r e k e n i n g ). getResultList (); } 

A software project is a standalone source code unit that performs a set of tasks. Furthermore, an API is like a blackbox, i.e., an interface that abstracts the piece of functionality offered by a project by hiding its implementation details. This interface is meant to support reuse and modularity [30] , [37] . An API X built in an object-oriented programming language, e.g., the Criteria API in Fig. 1(a) , consists of a set T X of public types, e.g., CriteriaBuilder and CriteriaQuery. Each type in T X consists of a set of public methods and fields that are available to client projects, e.g., the method createQuery of the type CriteriaQuery.

A method declaration consists of a name, a (possibly empty) list of types of parameters, a return type, and a (possibly empty) body, for example the findBoekrekeningen method in Fig. 1(b) . Given a set of declarations D in a project P , an API method invocation i is a call made from a declaration d ∈ D to another declaration m. Similarly, an API field access is an access to a field f ∈ F from a declaration d in P . API method invocations M I and field accesses F A in P form the set of API usages U = M I ∪ F A. Finally, an API usage pattern (or code snippet) is a sequence (u 1 , u 2 , ..., u n ), ∀u k ∈ U . For the sake of presentation, in the scope of this paper the following terms are used interchangeably: method declaration vs. declaration and API vs. invocation. For each declaration, we extract its method name, a list of types of the parameters, and a list of API function calls. In this way, a project is represented as a set of declarations from its constituent classes.

Our tool makes use of a context-aware collaborativefiltering technique to search for invocations from highly relevant projects. This allows us to consider both project and declaration similarities to recommend APIs and code snippets. Following the terminology of recommender systems [8] , we treat projects as the enclosing contexts, method declarations as users, and method invocations as items. Intuitively, we recommend a method invocation for a declaration in a given project, which is analogous to recommending an item to a customer in a specific context. For instance, the set of method invocations and the usage pattern (cf. framed code in Fig. 1(b) ) recommended for the declaration findBoekrekeningen can be obtained from a set of similar projects and declarations in a codebase. The collaborative aspect of the approach enables to extract recommendations from the most similar projects, while the context-awareness aspect enables to narrow down the search space further to similar declarations.

The architecture of FOCUS is depicted in Fig. 2 . To provide its recommendations, FOCUS considers a set of OSS Repositories 1 . The Code Parser 2 component extracts method declarations and invocations from the source code or bytecode of these projects. Project Comparator, a subcomponent of Similarity Calculator 3 , measures the similarity between projects in the repositories and the project under development. Using the set of projects and the information extracted by Code Parser, the Data Encoder 4 component computes rating matrices which are introduced later in this section. Afterwards, Declaration Comparator computes the similarities between declarations. From the similarity scores, Recommendation Engine 5 generates recommendations, either as a ranked list of API function calls using API Generator, or as usage patterns using Code Builder, which are presented to the developer. In the remainder of this section, we present in greater details each of these components.

FOCUS is dependent on Rascal M 3 [4] to function. Rascal M 3 is an intermediate model that performs static analysis on source code to extract method declarations and invocations from a set of projects. This model is an extensible and composable algebraic data type that captures both languageagnostic and Java-specific facts in immutable binary relations. These relations represent program information such as existing declarations, method invocations, field accesses, interface implementations, class extensions, among others [4] . To gather relevant data, Rascal M 3 leverages the Eclipse JDT Core Component 3 to build and traverse the abstract syntax trees of the target Java projects.

We consider the data provided by the declarations and methodInvocation relations of the M 3 model [4] . Both of them contain a set of pairs v 1 , v 2 , where v 1 and v 2 are values representing locations. These locations are uniform resource identifiers that represent artifact identities (aka. logical locations) or physical pointers on the file system to the corresponding artifacts (aka. physical locations). The declarations relation maps the logical location of an artifact (e.g., a method) to its physical location. The methodInvocation relation maps the logical location of a caller to that of a callee.

Listing 1 depicts an excerpt of the M 3 model extracted from the code presented in Fig. 1(a) . The declarations relation links the logical location of the method findBoekrekeningen, to its corresponding physical location in the file system. The methodInvocation relation states that the getCriteriaBuilder method of the EntityManager type is invoked by the findBoekrekeningen method in the current project.

Once all the method declarations and invocations have been parsed with Rascal, FOCUS represents the relationships 3 Fig. 1(a) .

among them using a rating matrix. Given a project, each row in the matrix corresponds to a declaration, and each column corresponds to an API call. A cell is set to 1 if the declaration in the corresponding row contains the invocation in the column, otherwise it is set to 0. In Fig. 3 , we show an example of the rating matrix for an explanatory project p 1 with four declarations p 1 (d 1 , d 2 , d 3 , d 4 ) and four invocations (i 1 , i 2 , i 3 , i 4 ). In practice, a matrix is generally big to contain a large number of methods and invocations. We conceptualized a 3D context-based ratings matrix to model the intrinsic relationships among various projects, declarations, and invocations. The third dimension of this matrix represents a project, which is analogous to the socalled context in context-aware CF systems. For example, Fig. 4 (a) depicts three projects P = (p a , p 1 , p 2 ) represented by three slices with four method declarations and four method invocations. Project p 1 has already been introduced in Fig. 3 and, for the sake of readability, the column and row labels are omitted from all the slices in Fig. 4(a) . There, p a is the active project and it has an active declaration d a . Active here means the artifact (project or declaration), being considered or developed. Both p 1 and p 2 are complete projects similar to the active project p a . The former projects, i.e., p 1 and p 2 are also called background data since they are already available and serve as a base for the recommendation process. In practice, the more background projects we have, the better is the chance that we recommend relevant API invocations.

By exploiting the context-aware CF technique, the presence of additional invocations is deduced from similar declarations and projects. Given an active declaration in an active project, it is essential to find the subset of the most similar projects, and then the most similar declarations in that set of projects.

To compute similarities, we devised a weighted directed graph that models the relationships among projects and invocations. Each node in the graph represents either a project or an invocation. If project p contains invocation i, then there is a directed edge from p to i. The weight of an edge p → i represents the number of times a project p performs the invocation i. Fig. 4 (b) depicts the graph for the set of projects in Fig. 4(a) . For instance, p a has four declarations and all of them invoke i 4 . As a result, the edge p a → i 4 has a weight of 4. In the graph, a question mark represents missing information. For the active declaration in p a , it is not known yet whether invocations i 1 and i 2 should be included.

Considering (i 1 , i 2 , .., i l ) as a set of neighbor nodes of p, the feature set of p is the vector − → φ = (φ 1 , φ 2 , .., φ l ), with φ k being the weight of node i k . Each constituent weight is computed as the term-frequency inverse document frequency value, i.e., φ k = f i k * log( |P | ai k ), where f i k is the weight of the edge p → i k ; |P | is the number of all considered projects; and a i k is the number of projects connected to i k . Eventually, the similarity between p and q is computed as the cosine between their corresponding feature vectors − → φ = {φ k } k=1,..,l and − → ω = {ω j } j=1,..,m , given below:

(1) Given that F(d) and F(e) are the sets of invocations for declarations d and e, respectively, then the similarities between d and e are calculated using the Jaccard similarity index as follows:

This component is a part of Recommendation Engine, and it is used to generate a ranked list of API function calls. As shown in Fig. 4 (a), the active project p a already includes three declarations, and at the time of consideration, the developer is working on the fourth declaration, corresponding to the last row of the matrix. p a has only two invocations, represented in the last two columns of the matrix, i.e., cells marked with 1. The first two cells are filled with a question mark (?), implying that it is not clear if these two invocations should also be integrated into p a . API Generator predicts additional invocations for the active declaration by computing the missing ratings exploiting the following collaborative-filtering formula [8] : (2); r d and r e are the mean ratings of d and e, respectively; and R e,i,p is the combined rating of d for i in all the similar projects, computed as follows [8] :

where topsim(p) is the set of top similar projects of p, k=|topsim(p)| is the number of neighbor projects; and sim α (p, q) is the similarity between p and a project q, computed using Eq. 1. Equation 4 implies that a higher weight is given to projects with higher similarity. In practice, it is reasonable since, given a project, its similar projects contain more relevant API calls than less similar projects. Using Eq. 3 we compute all the missing ratings in the active declaration and get a ranked list of invocations with scores in descending order, which is then suggested to the developer. In Eq. 4, a set of k projects is used to compute the ranking, and no matter how large k is, eventually we obtain a real score for each API. Therefore, the final list always contains N items, regardless of k.

In our implementation, we employed a sparse matrix to store the 3D tensor. This allows us to optimize both the storage and computation, and thus increasing the number of neighbor projects for the recommendation. By the current version, FOCUS is able to efficiently compute the recommendations, and maintain a trade-off between computational complexity and effectiveness.

This sub-component is responsible for recommending real code snippets to developers. From the ranked list, top-N invocations are selected as query to search the corpus for relevant declarations. To limit the search scope, we consider only the most similar projects. Using the Jaccard index as the similarity metric, for each query, we search for declarations that contain as many invocations of the query as possible.

Once the corresponding declarations are identified, their source code is retrieved using the declarations relation of the Rascal M 3 model. Thanks to its modularity, Rascal is able to decompile and analyze projects written in different programming languages [5] , e.g., Java [4] , C/C++ [2] , PHP [16] . Rascal also allows us to compute M 3 model from both source code folders and binaries, e.g., JAR files independently.

Thus we implemented a dedicated function that extracts the real source code of a method declaration by means of the computed M 3 model and the project location. Finally, the resulting code snippet is suggested to the developer.

This section describes two use cases that illustrate how FOCUS works in practice. Section 3.3.1 presents the final result produced by FOCUS for the motivating example in Section 2.1, while Section 3.3.2 describes the FOCUS IDE through a real development scenario, where we recommend both a list of API function calls and real source code.

In Fig. 1(a) , given that findBoekrekeningen is the active declaration, the invocations it contains are used together with the other declarations in the current project as the query to feed the recommendation engine. The produced outcome is a ranked list of real code snippets, and we show the top one, named findByIdentifier, in Listing 2.

public List<QuestionsStaged> findByIdentifier(String identifier) { log.fine("getting Session instance by identifier: " + identifier); try { CriteriaBuilder cb = entityManager.getCriteriaBuilder(); CriteriaQuery<QuestionsStaged> criteria = cb.createQuery(QuestionsStaged.class); Root<QuestionsStaged> qs = criteria.from(QuestionsStaged.class); criteria.select(qs).where(cb.equal(qs.get("identifier"), identifier)); log.fine("get identifier successful"); return entityManager.createQuery(criteria).getResultList();

} catch (RuntimeException re) { log.severe("get identifier failed" + re); throw re; } } Listing 2. Recommended source code for the snippet in Fig. 1(a) .

By comparing the recommended code and the original one in Fig. 1(b) , we realize that though they are not the same, they indeed share several method calls and a shared intent: both snippets exploit a CriteriaBuilder object to build, perform a query, and eventually retrieve some results. Furthermore, the outcome of both declarations is of the List type. More importantly, compared to the original code in Fig. 1(b) , the recommended snippet appears to be of a higher quality and robustness. We conclude that for the motivating example, FOCUS is helpful since the recommended code together with the corresponding list of function calls, i.e., get, equal, where, select, etc., provides the developer with practical instructions on how to use the API at hand to implement the desired functionality. 

As shown in Fig. 5 , FOCUS has been integrated into the Eclipse IDE. 4 The figure depicts a real development scenario where a developer is implementing the SQLDump project 5 by improving the existing code with recommendations provided by FOCUS. SQLDump is a simple command-line utility that exploits the apache-cli library 6 to execute an SQL query and export results as a CSV file. The first implementation of the main method prints parameter errors to the console by using Java I/O facilities, i.e., System.out.println 1 . FOCUS suggests to the developer both code snippets 2 and 3 , and a ranked list of predicted APIs 4 that are relevant to the code being developed. Furthermore, it recommends a possible improvement that includes the usage of the HelperFormatter class 5 : the catch statement block is completely defined and the System.out.println invocation is replaced by HelperFormatter provided by apache-cli. Meanwhile printHelp is a method of HelperFormatter that prints both possible parameter errors as well as an introduction on how to run SQLDump from command line. As a result, with the help of FOCUS, the developer can learn how to use the method both from the code snippets 3 and the list of API calls 4 .

The goal of this study is to evaluate FOCUS and compare it with two state-of-the-art tools, i.e., UP-Miner [54] and PAM [11] , with the purpose of determining the extent to which it can provide a developer with accurate and useful recommendations, featuring code snippets containing API usage patterns relevant for the developers' context. The quality focus relates to the API recommendation accuracy 4 . Instruction to install the IDE: https://bit.ly/3joJpnT 5. https://github.com/aparsons/SQLDump 6. http://commons.apache.org/proper/commons-cli/ and completeness, the time required to provide a recommendation, and the extent to which developers perceive the recommendation useful.

PAM has been chosen as baseline for comparison, since it is among the state-of-the-art tools in API recommendation: it has been shown [11] to outperform other similar tools such as MAPO [58] and UP-Miner [54] . To conduct the comparison with PAM, we exploited its original source code which has been made available online by its authors. 7 Furthermore, to facilitate future replications, we published all the artifacts together with the tools used in our evaluation in GITHUB [27] .

After formulating the research questions in Section 4.1, the following subsections describe datasets, analysis methodology, and the evaluation metrics used to evaluate FOCUS.

Our study aims to address the following research questions: RQ 1 : How does FOCUS compare with UP-Miner and PAM? Both UP-Miner [54] and PAM [11] are well-founded API recommendation tools. UP-Miner has been shown to outperform MAPO [58] , while PAM gains a superior performance compared to both UP-Miner and MAPO. In our previous work [26] , we showed that FOCUS outperforms PAM on different datasets collected from GITHUB and MVN. In this work, we compare FOCUS with UP-Miner and PAM on an Android dataset to further study their performance on a new application domain.

RQ 2 : How successful is FOCUS at providing recommendations at different stages of a development process? For a recommender system, it is essential to be able to return relevant recommendations, indicating by a high number of true positives as well as a low number of both false positives and false negatives. 7 . https://github.com/mast-group/api-mining This research question evaluates to which extent our tool can provide accurate and complete results.

Is there a significant correlation between the cardinality of a category and accuracy? We examine whether given a testing app, having more apps of the same category is beneficial to the recommendation outcome.

We study if the recommended code snippets provided by FOCUS are relevant to support developers in fulfilling their tasks.

RQ 5 : How are FOCUS recommendations perceived by software engineers during a development task? Finally, we are interested in investigating whether FOCUS is useful from a developer point of view. To this end, we conducted a user study to evaluate the relevance of API calls and code snippets provided by FOCUS to support a particular development context. A group of 16 Master's students in Computer Engineering has been involved to assess two real-world development scenarios.

In the following, we describe the dataset used to address RQ 1 -RQ 4 , as well as the data extraction method. As it is explained in Section 4.3, for RQ 5 we rely on different datasets, because the aim is to let developers leverage FOCUS recommendations, and tasks should be simple enough for an experimental setting.

While FOCUS is able to work with different data sources as well as programs written in various languages, the evaluation context for this paper focuses on the applicability to a specific domain, i.e., Android programming. Although Android development is per se not very different from the development of other kinds of applications, after the evaluation reported in our previous paper featuring heterogeneous Java programs [26] , the aim of this evaluation is to show how, by learning from a training set belonging to applications from the same ecosystem, FOCUS is capable of providing accurate recommendations. We have chosen Android not only because of the large availability of data needed to perform an empirical evaluation, but also because recommending API calls and usage patterns is deemed to be important in Android programming [10] .

Since FOCUS accepts as input data extracted by Rascal, which in turn requires a specific format, we devised our own method to acquire an Android dataset eligible for the evaluation. The extraction process needs to comply with some certain requirements, and it is illustrated in Fig. 6 . First, we exploited the AndroidTimeMachine platform [12] to crawl open source projects. The platform fetches apps from the Google Play store 8 and associates them with the open source counterparts hosted in GitHub. The crawling process resulted in a set of 7,968 open source Android apps. Most of the apps (82%) in the dataset are written in Java; 4% in Kotlin; 4% in JavaScript, 2% in C++, and 1% in C#. The remaining 7% belong to other languages.

As Rascal can parse certain programming languages, from the initial dataset we filtered out irrelevant projects to select 8 only the Java and Kotlin ones, which account for the majority of the apps. Afterwards, we retrieved the corresponding compiled APK files by querying the Apkpure platform 9 using some tailored Python scripts [46] . The process culminated in the final corpus consisting of 2,600 APK binary files (mined from Apkpure) together with additional metadata (mined from Google Play), including authors, categories, star rating, price, and the number of downloads. By carefully inspecting the data, we realized that most of the apps are highly rated and they have a high number of downloads. We decompiled the APKs into the JAR format by means of the dex2jar tool [1]. The JAR files were then fed as input for Rascal to convert them into the M 3 format, which can eventually be consumed by FOCUS.

In total, there are 26,854 API functions in the whole dataset, and most of them are invoked by a small number of declarations (and thus projects 10 ): 15,731 APIs are called in only one project. Only a tiny fraction of the APIs is extremely popular by being included in a large number of projects: ten APIs are called in more than 1,900 projects and 15,000 declarations. The most popular API call is java/lang/StringBuilder/append(java.lang.String) and it appears in 2,512 projects and 54,828 declarations.

Altogether, this reflects the long tail effect which has already been encountered by third-party libraries recommendation [25] . Such an effect can be expressed as follows: For many outcomes, about 80% of consequences originate from 20% of the causes [19] . When we apply this to API recommendation, it is interpreted as: "About 80% of the APIs come from 20% of the apps." As it has been shown in various studies [25] , [52] , providing products in the long tail is beneficial to the final recommendations. In a similar fashion, we suppose that the ability to suggest APIs rarely included by apps, is of particular importance, as this may help discover useful APIs that have been normally obscured from search engines.

A summary of the categories and their corresponding number of items in the considered dataset is also provided. Due to the space limit, we cannot show and discuss all the figures here. Please refer to the online appendix for more details. 11 With this dataset, we aim at evaluating if the proposed approach is able to support mobile developers in diverse application domains as well as with various levels of apps' maturity, thereby attempting to resemble real-world development scenarios. We use the collected dataset in RQ 1 , RQ 2 , RQ 3 , and RQ 4 to evaluate FOCUS as well as to compare it with the two baselines. 9 . https://apkpure.com/ 10. For the sake of presentation, from now on the two terms "app" and "project" are used interchangeably.

11. https://mdegroup.github.io/FOCUS-Appendix/ Finally, the following main steps are conducted to create the required metadata, which can then be used to feed FOCUS.

• the corresponding Rascal M 3 model is generated for every project in the dataset; • the corresponding ARFF representation 12 for each M 3 model is generated in order to be used as input for applying FOCUS and PAM during the actual evaluation steps discussed in the next sections.

To evaluate FOCUS in RQ 1 -RQ 4 , we simulate the behavior of a developer who is programming a project and needs practical recommendations to complete it. Figure 7 provides an intuition on how the extraction of an active/testing project p a is done. The project consists of a set of declarations and they are divided into three parts, namely P1, P2, and P3, which are explained as follows.

• P1: A set of complete declarations, e.g., Declaration 1, Declaration 2, etc. • P2: A testing declaration, for this declaration, only a portion of code is available to feed the recommendation engine, while the rest is removed and saved as groundtruth data. This corresponds to the scenario in Fig. 1(a) , where the developer is implementing the active declaration d a , and she needs recommendations on the next APIs to be added; • P3: Removed declarations: A certain part consisting of some declarations is removed. This aims to simulate the scenario when the developer is only at an early stage of the project. Correspondingly, there are the following parameters:

• ∆ is the number declarations in p a (∆ > 0); • Only δ declarations (δ < ∆) are used as input for recommendation and the rest is discarded; • In total, d a has Π invocations, however only the first π invocations (π < Π) are selected as testing, and the rest is ground-truth data; • k is the number of neighbour projects (cf. Section 3.2.4); • Given a ranked list of APIs, the developer typically pays attention to the top-N items only, i.e., N is the cut-off value for the list. For d a , only a half of the code lines of the method's body is selected to feed the recommendation engine. In fact, 12 . https://www.cs.waikato.ac.nz/ml/weka/arff.html Rascal can parse only compilable code, thus there might be some compilation errors at some points, where the code is incomplete. As a result, in practice, we suppose that FOCUS can provide recommendations only when the developer temporarily stops at a certain point where the whole declaration becomes compilable. Thus, to increase the applicability of FOCUS, as a developer one should try to make the code compilable as soon as they can by closing open loops, try/catch blocks, return statements, etc. This is supported pretty well by IDEs such as Eclipse which automatically recommend and insert closed loops and try/catch blocks. In this respect, we suppose that in most cases, code is executable, though it is yet complete. Table 1 shows four configurations, i.e., C1.1, C1.2, C2.1, and C2.2, corresponding to different combinations of δ and π. Furthermore, C1.1 and C1.2 as well as C2.1 and C2.2 are pairwise relevant. For example, both C1.1 and C1.2 have the same number of method declarations (δ), they differ in the number of invocations in the testing declaration (π).

For the purpose of validation, the original dataset (cf. Section 4.2.1) was split into two independent parts, namely a training set and a testing set. In practice, the training set represents the OSS projects that have been collected ex-ante, and they are available at the developer's disposal, ready to be exploited for any mining purposes. The testing set represents the project being developed, or the active project. We opted for k-fold cross validation [20] as it has been widely chosen to study machine learning models. Depending on the availability of input data, the dataset with n elements is divided into f equal parts, so-called folds. For each validation round, one fold is used as testing data and the remaining f -1 folds are used as training data. In our evaluation, two values were selected, i.e., f =10 and f =n. The former corresponds to ten-fold cross validation while the latter corresponds to leaveone-out cross validation [56] , and they are exploited depending on the purpose as well as the availability of data. With ten-fold cross validation, we shuffle the list of the apps considered in the evaluation, and then randomly split them into ten equal parts. In the evaluation, we attempt to equally distribute the projects into the folds, so as to maintain a balance among the folds with respect to the projects' size. For every experiment, the execution is done ten times: each time one fold is used for testing, and the remaining nine folds are used as training data. Eventually, we averaged out the metrics obtained from the ten folds to get the final results.

For a testing project p, the outcome of a recommendation process is a ranked list of invocations, i.e., REC(p). We believe that the ability to provide accurate invocations is important in the context of software development. Thus, we are interested in how well a system can recommend API invocations that eventually match with those stored in GT(p). To measure the performance of UP-Miner, PAM and FOCUS, we utilize Success rate, Precision, Recall, Levenshtein distance, and Time. Given that REC N (p) is the set of top-N items and match N (p) = GT (p) REC N (p) is the set of items in the top-N list that match with those in the ground-truth data, then the metrics are defined as follows. Nearly the first half of the declarations is used and the second half is discarded. The last declaration of the first half is selected as the active declaration da. For da, only the first invocation is provided as query, and the rest is used as ground-truth data, i.e., GT(p) . This configuration represents an early stage of the development process and, therefore, only limited context data is available to feed the recommendation engine.

Similarly to C1.1, almost the first half of the declarations is retained and the second half is removed. da is the last declaration of the first half declarations. For da, the first four invocations are provided as query, and the rest is GT(p).

The last method declaration is selected as testing, i.e., da and all the remaining declarations are used as training data. In da, the first invocation is kept and all the others are taken out as ground-truth data GT(p). This mimics a scenario where the developer is almost finished implementing p.

∆ − 1 4 Similar to C2.1, da is selected as the last method declaration, and all the remaining declarations are used as training data. The only difference with C2.1 is that in da, the first four invocations are used as query and all the remaining ones are used as GT(p).

Success rate. For a set of P testing projects, this metric measures the rate at which a recommender can return at least a match among the top-N items for every project p ∈ P .

Precision and recall. Precision P@N is the fraction of the top-N recommended items to the total number of items in the ground-truth, while recall R@N is the ratio of the groundtruth items being found in the top-N items:

Levenshtein distance. Given two strings s 1 and s 2 , the Levenshtein edit distance between them corresponds to the number of substitutions performed to transform s 1 to s 2 . The metric is defined as follows. 13

where i and j are the terminal character position of strings s 1 and s 2 , respectively.

Recommendation time. The time needed for the systems to generate predictions is measured using a laptop with Intel Core i5-7200U CPU @ 2.50GHz×4, 8GB RAM, and Ubuntu 16.04.

To address RQ 1 , we compare the performance of FOCUS with that of UP-Miner and PAM. Our experience [26] reveals that PAM cannot scale well with large datasets, i.e., it suffers from a high computational complexity. Meanwhile, FOCUS is more efficient as it is capable of incorporating a large number of background projects and swiftly producing recommendations. In particular, both systems were experimented on a mainstream laptop using a set of 549 training projects with 80MB in size to measure the execution time [26] . On average, PAM requires 320 seconds to provide a recommendation, while FOCUS needs just 1.80 seconds. Through a careful observation on the Android dataset (cf. Section 4.2.1), we realized that many of them are big in size, and a training set of 2,360 apps may add up to more than 2.0GB. This essentially means that it is infeasible to run PAM on the entire dataset, since the execution time may exponentially soar. Thus, for RQ 1 we can leverage only a 13 . https://dzone.com/articles/the-levenshtein-algorithm-1 portion of the original corpus. To be more precise, we selected 500 apps of average size. There are 39 categories in total and most of them contain a small number of apps, while Tools is still the biggest category with 151 apps, accounting for 30.20% of the total amount. We opted for leave-one-out crossvalidation [56] , aiming to exhaustively exploit the background data. We study the performance of FOCUS by considering all the four configurations listed in Table 1 , i.e., C1.1, C1.2, C2.1, and C2.2. The cut-off value N is used to investigate how accurately the system is able to provide recommendations with respect to different lengths of the ranked list. In RQ 1 , we set N to 30, attempting to study the three systems on a long list of recommendations. We also consider, as can be seen in Eq. 4, different values of the number of neighbor apps, i.e., k={1, 2, 3, 4}. The evaluation was executed 500 times, by each validation, one app is used as testing and all the remaining 499 apps are used for training. To aim for a reliable comparison, we ran UP-Miner and PAM using their original settings in our evaluation.

For this research question, we made use of the whole corpus introduced in Section 4.2.1, which contains all the 2,600 collected apps. Moreover, since we have a larger amount of data compared to RQ 1 , we employ tenfold cross-validation in this research question. We analyze the performance of FOCUS for combinations of: (i) different configurations, i.e., C1.1, C1.2, C2.1, and C2.2; (ii) different values of N , i.e., N ={1, 5, 10, 15, 20}; and (iii) different values of k, i.e., k={1, 2, 3, 4, 6, 10}. The rationale behind the selection of such specific values is as follows. We should incorporate a certain number of neighbor projects k when computing recommendations, otherwise the matrix will become big (cf. Fig. 4(a) ), which possibly induces an expensive computational cost. While such a large number of N seems to be unrealistic, in the scope of our evaluation, we have to consider it to ensure the generalizability of our final conclusions. In practice, a small enough number of N items should be presented to the developers, so as to avoid overwhelming them. We report, for different configurations and values of N and k, the success rate, and performance gain. Also, we plot the precision/recall curves for different configurations and values of k. RQ 3 . To address RQ 3 , we perform controlled experiments on the whole dataset described in Section 4.2.1. Similar to RQ 2 , we conducted the experiments following the tenfold cross-validation methodology. The apps collected in the corpus span over a total of 47 categories, such as Productivity, Communication, Music & Audio, or Business. The cardinality (i.e., the number of apps within a category) of the categories varies considerably: most of them contain a small number of apps, i.e., ranging from 1 to 20 items for almost half of the topics. The biggest category with 659 apps is Tools, while there are three categories with only two apps, i.e., Trivia, Music, and Parenting.

With this research question, we aim at examining if there is a strong positive correlation between two variables, i.e., the cardinality of a category and the corresponding precision. In other words, we hypothesize that apps belonging to populous categories might possibly get a better recommendation since they have more, presumably, relevant background data, i.e., projects coming from the same domains. This would have an impact in practice as follows: once the developer specifies one or more domains for her app, we can search for recommendations just by looking for apps within the same domains, aiming to narrow down the search scope. This is useful since it contributes to a reduction in the overall execution time. However, this is a pure assumption, which needs to be carefully studied through concrete experiments.

For each category, we computed the precision for all of its constituent apps following Eq. 6, and the precision of a category was averaged out over the apps. Eventually, the correlation between the cardinality and precision is computed using the Spearman's rank correlation coefficient and Kendall, i.e., ρ and τ , respectively. The coefficients range from -1 (perfect negative correlation) to +1 (perfect positive correlation), while ρ=0 or τ =0 implies that the variables are not correlated at all. The reason why we compute both Spearman's and Kendall's correlation is because the number of categories is relatively small, and the Spearman's correlation may be more suitable in this case. We do not use the Pearson's correlation as we cannot assume the presence of a linear relationship between categories and precision. RQ 4 . In this research question, we study if FOCUS is able to recommend source code relevant to the method declaration under development, exploiting the ten-fold cross-validation technique. As an example, we assume that the developer is working on the incomplete code snippet depicted in Fig. 1(a) , and FOCUS is expected to suggest real code such as the one in Fig. 1(b) , or the one in Listing 2.

To evaluate the similarity between two declarations, we compare their constituent APIs. This comparison is based on the observation coming from an existing work [22] that if projects or declarations share API calls implementing the same requirements, then they are considered to be more similar than those that do not have similar API usage. Following the same line of reasoning, we evaluate the similarity/relevance between two snippets by examining if they share common API function calls and have the same sequence of these calls.

To address this research question, we leverage the dataset of 500 apps also used to address RQ 1 . We deliberately make use of such a small dataset due to the following reason: with this dataset, we analyze the ability of FOCUS to recommend relevant code snippets, given that there is a fairly small amount of training data. We conjecture that, as confirmed later in the paper, if FOCUS works effectively on a small dataset, it will perform well on bigger ones. To evaluate if a recommended snippet is relevant to the query, we measure the level of similarity between them using the Levenshtein edit distance [21] , which has been used by prior work for similar purposes, e.g., tracking source code clones [49] . Given the source code of a declaration d 1 , we parse it using Rascal to get the API invocations. Afterwards, we encode each of the invocations using a unique character, resulting in a string s 1 . Thus, the evaluation of the similarity between two declarations d 1 and d 2 boils down to comparing the corresponding strings s 1 and s 2 , by counting the number of replacements needed to convert s 1 to s 2 using Eq. 8. Such a metric takes into account not only the common characters between s 1 and s 2 , but also the order in which they appear. Correspondingly, this means that two code snippets are similar/relevant if they share common API function calls as well as have the same sequence of the calls. In this sense, the smaller the distance we get, the more similar the two snippets are, and vice versa.

To simplify the comparison performed in RQ 4 , we only used Configuration C1.2 (cf. Table 1 ). The rationale behind the selection of the configuration is as follows: it represents a more authentic development scenario, corresponding to the situation where the developer already finishes a part of the declaration, and she expects to get recommendations. To be more concrete, given a testing project, we kept the first half of the declarations and removed the second half; the last declaration of the first half declarations is selected as the testing one d a . For d a , the first four invocations are provided as query, and the rest is GT(p). Using the Code Builder subcomponent (cf. Section 3.2.5), we extracted the real source code of a declaration by means of the computed M 3 model and the project location.

In fact, APK files do not contain source code, thus it is not possible to directly mine real code snippets from the apps. However, FOCUS allows us to extract the method canonical name of a recommended code snippet within the project scope. Moreover, since the dataset is extracted from AndroidTimeMachine, there is a mapping between open-source Google store apps with their corresponding repositories. To locate the right pair of APK file and GITHUB repository, we check the snapshot date when the mapping was created. In this way, we are able to trace back to the original source code for those apps that have a counterpart in GITHUB. Eventually, FOCUS is able to recommend source code, as long as the corresponding app is associated with a source project rooted in GITHUB.

In this section, we study FOCUS's usefulness of code and API recommendations by means of a task-based user study to address RQ 5 .

The goal of this study is to evaluate FOCUS, with the purpose of understanding whether it could help developers with their implementation tasks. The quality target of the study is the perceived usefulness that developers have of recommendations (code snippets and APIs) provided by FOCUS. The context consists of participants, i.e., 16 Master's students in Computer Engineering, and objects, i.e., programs involving command line argument parsing and HTML download/parsing.

As shown in Table 2 , the experimental design is a crossover design in which participants were split into four groups (each 15 and require the completion of three partially implemented methods. commons-cli provides APIs for parsing command-line options passed to programs, while jsoup is a library for parsing and manipulating HTML pages using the best of DOM, CSS, and jquery-like methods. String[] arguments = new String[]{"-url", "a", "-pass", "pass", "-user", "user", "-sql", "sql"}; assertEquals(4,Launcher.parse(arguments).size()); } @Test public void printUsageTest() throws IOException { assertNotEquals("", Launcher.printUsage()); } Listing 4. The unit tests for checking the correctness of the task.

For the tasks with commons-cli, the participants completed three methods by: (i) implementing a method for specifying the command-line options (we provided the evaluator with the parameter list); (ii) parsing the command line parameters and throwing an exception if the mandatory ones are missing; (iii) handling parsing exception by printing possible options to the console. Listing 3 shows an example of the partial implementation and the method requirements for specifying options. For a detailed description of the two performed tasks, due to space limit, interested readers are kindly referred to our online appendix. 16 14. https://commons.apache.org/proper/commons-cli/ 15. https://jsoup.org/ 16. https://mdegroup.github.io/FOCUS-Appendix/tasks.html For each method to be completed, we provided (for treatments having the availability of FOCUS) each evaluator with the top-5 snippets and top-20 method invocations recommended by FOCUS by giving the initial and partial method implementation as input.

Under the circumstance in which the experiment was conducted, it was neither possible to perform the experiment in a laboratory 17 nor to ask participants to return the results immediately. Instead, each participant could perform the tasks offline and return them to us. Before the study, we performed a laboratory introductory session in video conference, in which we introduced to participants the laboratory goals and tasks (without details about our research question, to avoid biasing them), and left them a detailed instruction documents.

During the tasks, participants could access any resource available on the Internet, besides FOCUS recommendations when available based on the study design. Once a participant finished the tasks, s/he had to complete a questionnaire 18 consisting of the following questions: (i) three general questions asking about their experience in programming and code search engine; (ii) four questions, in a 5-level Likert scale [29] , related to the understandability and complexity of the assigned tasks; and (iii) four questions to evaluate the relevance and usefulness of the recommendations provided by FOCUS.

Moreover, we asked the participants to submit their implementations. Such implementations have been used for understanding the correctness of the resulting code. For each method to be completed, we defined a specific JUnit 19 unit test for checking their correctness. We did not provide the evaluator with the test methods to avoid bias towards the experiment. Listing 4 reports the simple testing methods used to check the correctness of the submitted task. Although the unit tests are rather simple, they have been able to effectively catch any possible implementation failures.

Then, we involved a senior developer experienced with Java programming, gsoup and commons-CLI libraries to further investigate the method implementations where the unit test fails. The senior developer checked the severity of the identified errors and discarded those that are not related to the usage of the involved library. For instance, some evaluators named the parameters differently, e.g., they used password instead of pass or username instead of user. Consequently, the dedicated parseOKTest test fails because of a wrong parameter naming. We marked this type of failure as a minor one, and we considered the implementation as correct for the evaluation scope.

There are the following analyses to address RQ 5 :

• We perform a Wilcoxon signed rank test [55] significant difference between the percentage of tests passed with and without the availability of FOCUS). Also, we compute the Cliff's delta effect size [13] . • As for the questionnaire results, we report them using diverging stacked bar charts and discuss them.

This section analyzes the experimental results obtained through the evaluation by referring to the four research questions mentioned in Section 4.1. Table 3 reports the success rate for PAM and FOCUS, considering different configurations and values of k representing the number of neighbor apps. The cut-off value N was set to 30, attempting to investigate the systems' performance for a long list of recommendations. The table shows an evident outcome: FOCUS always achieves a much better success rate than that of PAM and UP-Miner by all the configurations. For instance, with C1.2, FOCUS gets 89.81%, 91.10%, 92.80%, and 93.22% as success rate by k=1, k=2, k=3, and k=4, respectively, while PAM and UP-Miner get 52.20% and 37.33%, respectively. With C2.2, FOCUS gets a maximum success rate of 92.10%, which is superior than 58.40% and 40.66% obtained by PAM and UP-Miner, respectively. We further confirm the claim by Fowkes and Sutton [11] , i.e., PAM outperforms UP-Miner also in our setting. Concerning the execution time, for the given dataset, both FOCUS and UP-Miner provide a recommendation in less than 0.01 seconds. Specifically, such a time is of 3.8×10 −4 for UP-Miner (the fastest), 8×10 −3 seconds for FOCUS, and 1.6 seconds for PAM.

The performance gain obtained by FOCUS is understandable in the light of the following arguments. UP-Miner works on the basis of clustering techniques and it is dependent on the similarity among groups of APIs. In other words, UP-Miner computes similarity at the sequence level, i.e., invocations that are usually found together. PAM is a complex system, which consists of six building blocks, i.e., probabilistic model, inference, learning, inferring new patterns, candidate generation, and mining interesting patterns. The system uses a probability distribution to define a distribution over all possible API patterns present in client code, based on a set of API patterns. It also employs a generative model to infer the most probable patterns from ARFF files. Finally, the system generates candidate patterns by relying on the highest support first rule, i.e., searching for the best candidate earlier. Due to these technical details, both UP-Miner and PAM can recommend APIs that commonly appear in different code snippets. In contrast, FOCUS is able to consider similarity both at the project level and the declaration level. Therefore, given an active project, FOCUS mines API calls from the most similar declarations in the most similar projects. As a result, this allows FOCUS to outperform both UP-Miner and PAM in finding invocations that fit well to a given context.

It is worth noting that FOCUS gets a considerably high performance, given that the dataset is fairly small. The maximum success rate obtained by C1.2 and C2.2 is 93.22% and 92.10%, respectively. Compared to our previous work [26] , where a set of 200 GITHUB projects was considered to compare FOCUS with PAM, we see that FOCUS substantially improves its recommendations when more data is incorporated into the training. A feature of the considered datasets which may affect the results obtained by FOCUS is the level of dependencies in Android apps compared to that of the GITHUB projects. In particular, by counting the number of unique APIs in each app/project for both the Android dataset and the GitHub dataset we see that the former contains more APIs compared to the latter. Many apps have more than 400 unique APIs, meanwhile, most of the GitHub projects have less than 200 unique APIs. This is further supported by previous work [39] , [53] , which gives evidence that Android projects make heavy use of third-party libraries as well as native libraries.

Answer to RQ1. While UP-Miner is the fastest tool in terms of recommendation time, FOCUS is the most effective one as it substantially outperforms both UP-Miner and PAM with respect to the prediction performance, while keeping the recommendation time below 0.01 seconds. Moreover, FOCUS mines better on Android apps compared to GITHUB projects.

In this research question, we are interested in understanding the completeness and accuracy of FOCUS's recommendations at different project's development stages. For the former, we analyze the corresponding success rate and performance gain, while for the latter, we take into consideration the obtained precision and recall values. Furthermore, we investigate the system's ability to recommend APIs in the long tail.

Success rate. Table 4 compares the success rates obtained by the considered experimental settings. For the smallest cutoff value N, i.e., N=1, FOCUS is still able to provide matches. For instance, with C1.1 when k=2, the system gets 67.46% as success rate, and this score increases linearly along N: FOCUS gets a success rate of 76.84% and 82.80% when N=5 and N=20, respectively. By Configuration C1.2, compared to C1.1, we see a sharp increase in performance by all the cut-off values. Take as an example, with k=2, we get 91.11% as success rate for N=5, and the score goes up to 94.07% when N=20. This demonstrates that FOCUS is capable of providing good match even when the developer wants to see a fairly short ranked list. Similarly, by C2.1 and C2.2, FOCUS enhances its success rate alongside k and N.

Next, we investigate the effect of changing the number of neighbor apps used in computing recommendations, i.e., k, on the final outcome by comparing the results columnwise. It is evident that when incorporating more neighbors for C1.1 and C1.2 share the same amount of method declarations (δ), they only differ in the number of invocations used in the testing declaration (π). Thus, to investigate the effect of changing π on the recommendations, we consider each pair of related configurations. The results in Table 4 indicate a sharp rise in performance when the configurations change from C1.1 to C1.2. Take as an example, when k=2 and N=1, that means we consider only the first item in the ranked list, FOCUS obtains 85.69% as success rate which is much better than 67.46%, the score yielded by C1.1. When k=10 and N=20, the maximum success rate for C2.1 and C2.2 is 90.11% and 96.92%, respectively. This suggests that incorporating more invocations, e.g., four instead of one invocation, helps FOCUS significantly enhance its overall performance. In practice, this means that given a declaration, the system is able to provide more accurate recommendations proportionally to the project's maturity.

Given the results in Table 4 , we analyze the performance gain in percentage (%) and report them it in Table 5 . The green color and various levels of density are employed to represent the corresponding magnitude. From the table, it is evident that the color gradually fades when we move from left to right, top to bottom, implying that the enhancement goes down linearly when we increase k and N. For example, the correlation between C1.1 and C1.2 is as follows: for N=1 the gain is 27.02% with k=2, and it decreases to 26.67% and 23.62% with k=3 and k=4, respectively; when k=10, the gain boils down to 18.74%. The same trend can be seen for other values of k and N. Likewise, the improvement obtained by C2.2 in comparison to C2.1 shares a similar pattern: it is big with low k and N, and small with higher k and N. For instance, it reaches 25.14% for N=1 and k=2 and shrinks to 7.95% for N=20 and k=10. Overall, this essentially means that while we get performance gain by incorporating more neighbors, at a certain point, such the gain becomes saturated and there will be no further improvement.

Accuracy. We report the accuracy achieved by all configurations using the precision recall curves (PRCs) depicted in Fig. 8(a), Fig. 8(b), Fig. 8(c) , and Fig. 8(d) . The cut-off value N has been varied from 1 to 30, aiming to study FOCUS's performance further down in the ranked list. First, we examine the effect of changing k on the precision recall curves. In fact, a system gets a good performance if its precision and recall are high at the same time, and this corresponds to a PRC close to the upper right corner of the diagram. From the figures, it is clear that incorporating more neighbor apps in computing recommendations results in a better accuracy by all configurations. For instance, with C1.1, we see a performance gain when increasing the number of neighbor apps: the best precision and recall are 0.75 and 0.63, respectively, and they are obtained when k=10; while by other value of k, i.e., k = {2, 3, 4, 6}, the system gets a lower precision and recall. Similarly by other configurations, k=10 is also the number of neighbor apps used for computing ratings that contributes to the best accuracy: with C1.2, FOCUS achieves 0.92 as precision and 0.84 as recall. With C2.1 and C2.2, the gain in performance when using 10 apps for computing recommendations becomes more evident, in comparison to other values of k, i.e., k = {2, 3, 4, 6}. This is consistent with the outcomes we got by the success rate scores presented in Table 4 : the system achieves a better performance if it incorporates more similar apps for computing recommendations.

In conclusion, we see that the performance of FOCUS using C1.2 is superior to that when using C1.1. Similarly, compared to C2.1, the accuracy obtained by FOCUS using C2.2 improves substantially, i.e., by equipping the query with more invocations. These facts further confirm that FOCUS is able to recommend more relevant invocations when the developer intensifies the declaration by adding more code. As can be seen in Eq. 4, when more invocations are available, the similarity among declarations can be better determined, resulting in a gain in performance.

The long tail. We counted the APIs that are recommended more often by FOCUS. By carefully checking the top 20 recommended items, we realized that most of them reside in the long tail. For example, the java/lang/StringBuilder/-toString() API has been provided 190 times by FOCUS, being the top most recommended item. However, this invocation is only ranked 646 in the list of all the APIs in the dataset. Altogether, this is to show that while recommending very popular APIs may make sense, FOCUS goes far beyond that by recommending also items in the long tail. This is achieved since FOCUS mines API from highly similar projects, given an active project.

Answer to RQ2. FOCUS provides more accurate predictions when more similar projects are used for recommendation. It is capable of suggesting APIs in the long tail. The system improves its accuracy while the developer keeps coding.

Is there a significant correlation between the cardinality of a category and accuracy? Table 6 depicts the Spearman coefficients for all configurations, with respect to different numbers of N. The Kendall coefficients (τ ) are comparable to the Spearman ones (ρ), so we omitted them from the table, for the sake of clarity. By examining the results in Table 6 we see that, despite some fluctuations, mainly with C2.1 and C2.2, ρ is considerably small, i.e., the maximum value is ρ=0.160 for C1.2 and N=25. More importantly, most of the scores are close to 0, indicating an extremely low (e.g., by N = {5, 20, 25} with C1.1) or almost no correlation (e.g., by N = {15} with C1.1 and C2.2). As an example, Fig. 9 depicts precision and cardinality as well as their correlation for N=25. The variables are shown both on the x-axis and y-axis, however at different parts of the axes. This allows us to comprehensively represent the relationship between the two variables for all the four configurations. In particular, on the top-left corner, there is the histogram of precision with respect to cardinality, while the other bar charts at the bottom show the histogram for each of them individually. The middle frame in the top row specifies the correlation coefficients between precision and cardinality for all the configurations. Results show that there is a very weak correlation between the two variables. For instance, the coefficient is 0.032 for C1.1, or 0.036 for C2.2. As a whole, this unfortunately contradicts our initial conjecture: apps belonging to major categories do not get a better recommendation, although they have in principle, more background data. This means that searching for recommendations just by looking at apps of the same domain(s) does not guarantee that we will gain benefit. We attempt to ascertain the possible causes in the following.

According to a previous work [22] , if projects share API calls implementing the same requirements, then the projects are considered to be more similar than those that do not have similar API usage. We computed similarity among apps using the Similarity Calculator component presented in Section 3.2.3. Such a similarity is measured based on the constituent API function calls of an app (cf. Fig. 4(b) and Eq. 1). By carefully examining the final results, we realized that generally, similar apps do not originate from the same domain. To be concrete, considering a ranked list with five items for all 2,600 apps, i.e., N=5, the percentage of items that have similar apps coming from 1, 2, 3, 4, and 5 categories is 1.14%, 6.6%, 21.42%, 41.8%, and 29.0%, respectively. For instance, machinekit.appdiscover 20 belongs to Libraries & Demo, however its highly similar apps are from Education, Books & Reference, Health & Fitness, and Tools. Since FOCUS relies on the similarity function (cf. Section 3.2.3), it may retrieve invocations from projects in completely different domains to generate recommendations. This explains why projects of a category with a low number of items still get a good accuracy, resulting in a weak correlation between the cardinality of a category and accuracy. In a nutshell, there exists no correlation because even apps belonging to different categories still contain similar API usage.

Though the experiment suggests that we cannot save time by looking into some certain categories, on the bright side, it reveals an interesting feature of FOCUS: the tool is able to discover API calls from a wide range of apps, regardless of their origins.

There is no direct correlation between the cardinality of a category and prediction accuracy. Moreover, FOCUS is capable of mining API calls from apps belonging to various application domains.

As shown in Section 3.3.1, by using the incomplete code in Fig. 1(a) together with other testing declarations as query to feed FOCUS, we obtained a relevant snippet depicted in Listing 2, and this is just one of many good matches we got. To provide a concrete analysis, Fig. 10 depicts the distribution of the 500 apps dataset with respect to the number of projects (x-Axis) and Levenshtein distance between the testing declaration and the corresponding project (y-Axis). To facilitate a better view, we mark the apps as four separate clusters. Almost a quarter of the projects or 24% corresponding to 120 projects get zero as the final result, i.e., the distance between the recommended snippet and the original one is zero. This means that for each of these projects, the recommended declaration perfectly matches the original one. By the remaining ones, 23 projects among them accounting for 4.6%, have a distance of one, which also indicates a high level of code similarity. Almost a half of the dataset, i.e., 233 apps corresponding to 46.60%, have a distance being larger than nine. 20 . https://bit.ly/3pGKKIL Figure 10 shows that, while FOCUS gains a good recommendation performance for a considerably large number of apps, it fails to retrieve matches for some others, i.e., the corresponding Levenshtein distance is large, meaning that the recommended snippets are not relevant to the groundtruth ones. For instance, one project has a distance of 52, or another has a distance of 43. We attempt to find out the rationale behind this outcome. Our main intuition is as follows, by those projects with a large Levenshtein distance, there is a lack of relevant training data. In other words, if there are not enough similar projects, FOCUS cannot discover API invocations which eventually fit to the active declaration.

To validate the hypothesis, the following test was conducted: we computed the precision scores for all projects, and compared them with the Levenshtein distances using the Spearman's rank correlation coefficient (Similar to RQ 3 ). The resulting score is ρ=-0.514, with p-value < 2.2e-16. This can be interpreted as follows: the obtained precision is disproportionate to the Levenshtein distance, or put another way, the higher the precision we get, the shorter the distance, and vice versa. The finding consolidates our assumption: if FOCUS achieves a high precision, it will be able to recommend more relevant code snippets. Furthermore, as we already proved in RQ 3 , FOCUS gets a higher precision if we use more similar apps for computing recommendations. Altogether, we conclude that our proposed approach is able to return relevant code snippets if it is fed with more training data. From the set of apps with a Levenshtein distance of 0, we enumerated the APIs and sorted them in descending order to see which invocations have been recommended most. We got a similar outcome of RQ 2 in Section 5.2: FOCUS recommends several APIs which appear late in the ranked list of the most popular invocations.

Answer to RQ4. FOCUS can provide relevant source code snippets to a testing declaration, as long as we feed it with a rich training dataset, i.e., there are more projects similar to the one being considered.

As explained in Section 4, 16 participants took part in the user study. Among them, 30% and 50% of them have three years and more than four years of programming experience, respectively. Most of them use a code search engine in a daily basis. Moreover, 80% of the participants agree that the tasks are clear and easy.

First, we analyzed whether the use of FOCUS could help participants produce more correct code. The median number of passed tests was 2 out of 3 both with and without FOCUS. We compared (pairwise, by participant) the percentage of passed test cases using a Wilcoxon signedrank test. The test did not indicate a statistically significant difference (p-value=0.88), i.e., the correctness of the produced implementations did not change with and without FOCUS. Also the Cliff's d effect size is negligible (d=0.01).

We therefore looked at the perceived usefulness of the recommendations, in terms of API invocations and code snippets. Results of the questionnaire related to the task assignments in Table 2 are shown in Fig. 11(a), Fig. 11(b) , Fig. 11 (c), and Fig. 11(d) .

Concerning the first question "Q1: Does FOCUS retrieve code snippets relevant to the context?" following Fig. 11(a) , 69% of the participants agree and strongly agree with the fact that the snippets are relevant, while the remaining 31% of them have no concrete judgment on the results, i.e., neutral. This means that most of the developers find that the recommended code snippets fit their programming tasks.

By the second question: "Q2: Do the recommended code snippets help you complete the lab assignments?" as shown in Fig. 11(b) , most of the participants find that the snippets recommended by FOCUS are helpful to solve the tasks. In particular, 73% of them agree and strongly agree with the question.

With the third question: "Q3: Does FOCUS retrieve invocations relevant to the context?" we are interested in understanding whether FOCUS can fetch invocations related to the given context. The results in Fig. 11 (c) suggest that more than a half of the evaluators, i.e., 56% think that the provided APIs are relevant, while 44% of them have no concrete judgment.

Finally, the results in Fig. 11(d) , corresponding to the last question: "Q4: Do the recommended invocations help you complete the lab assignments?" show that 7% of the participants disagree that the APIs are useful, while 27% of them feel neutral about the results. Still, most of them, i.e., 67%, appreciate the recommended APIs, which are helpful to solve their tasks.

Altogether, the Likert scores indicate that FOCUS provides decent recommendations: both the suggested APIs and code snippets are meaningful to the given contexts.

Answer to RQ5. The majority of the study participants positively perceived the context-specific relevance and the usefulness of the recommendations (APIs and code snippets) provided by FOCUS.

In this section, we discuss the experience gained from the experiments in Section 6.1. The threats that might hamper the validity of our findings are discussed in Section 6.2.

With FOCUS, by choosing a suitable number of similar projects used for computing recommendations, we get a satisfactory prediction accuracy, while still maintaining a reasonable computational complexity. We suppose that for a training set with big projects, i.e., those that have a large number of declarations and APIs, resulting in a very large 3D tensor, it is necessary to apply matrix factorization techniques. This allows us to represent the tensor in a lower dimensional latent space, so as to efficiently handle high dimension data as well as to increase the prediction performance.

The recommended snippet in Listing 2 is of high quality as it matches well with the developer's context. Nevertheless, there is no guarantee that such a good quality will be held for all the possible cases, as this depends a lot on the training data. Thus, we plan to implement a module to ask developers to rate/provide feedback on the given recommendations. By doing this, we would be able to collect information about the quality of a recommended snippet which can then be used to reinforce the learning of FOCUS.

Through the results obtained for RQ 2 , we confirm the importance of computing similarity among OSS projects [22] , [25] . FOCUS improves its prediction performance substantially, given that we incorporate more similar projects to compute the missing ratings. This implies that, given a project for which we would like to obtain recommendations, if we cannot find any similar projects, then it is not possible to recommend relevant API calls. This, along with the results obtained for RQ 4 , let us conclude that FOCUS relies on the availability of enough training data from similar projects, in order to provide relevant API calls as well as code snippets. The results of RQ 2 also imply that it would become more difficult for FOCUS to recommend code samples when the context is not fully available, or worse, is missing. Under the circumstances, we can apply code synthesis approaches [59] to generate code for a location where there is no concrete context. Moreover, we believe that it is important to improve the way FOCUS computes the similarity among projects and declarations, for instance by optimizing the global mapping using the Hungarian algorithm [57] . We consider all these issues as future improvements for FOCUS.

The RQ 3 results reveal two findings about FOCUS. First, FOCUS can provide good recommendations, regardless of the categories used for training. Second, given a project in a category, we can find very similar projects in different categories. The outcome provides valuable insight into the meaning of category extracted from Google Play: The categories specified by developers provide a rough abstract description of an app, rather than an informative summary of what the app does. This essentially means that these categories do not have much to do with similarity in API usages. Recently, attempts have been made to automatically assign a category to projects/apps [7] , [43] . Among others, supervised learning techniques perform computation by exploiting labeled data, e.g., the apps and their corresponding categories specified by developers. However, we suppose that someone may fail if they try to classify apps according to API calls together with the categories specified in Google Play. Thus, currently FOCUS makes use of similarity techniques working at the API level without considering any induced categorization.

Nevertheless, for a huge amount of training data, we anticipate that the categorization of apps may help increase the efficiency as follows. We can perform preprocessing steps to group similar apps in terms of API usages into the same cluster by means of unsupervised clustering techniques [28] . Given the presence of such clusters, every time there is an active app, FOCUS looks only in the cluster(s) that contain(s) similar apps by computing similarity with some of the most representative apps in the clusters. In this way, we can considerably narrow down the search scope for the active app, thereby speeding up the search.

In our empirical study, we focused on Java and Kotlin projects. However, FOCUS can be used to recommend APIs and source code for projects written in other languages, such as PHP, and C/C++, since Rascal also supports them [5] . Moreover, there are various reverse engineering tools that can extract declarations and invocations from source code. For example, the Eclipse JDT parser 21 has been widely used to parse Java source code in the related studies [54] , [58] . Similarly, dotPeek 22 uses a combination of debug information and web services to reconstruct C# source code. For Android programming, Kerberoid [18] has been proposed to decompile apps source code. Altogether, this implies that our tool becomes independent from Rascal as it can work with data parsed by different tools.

FOCUS is a data-driven approach, being dependent on the availability of decent training data. We believe that a lack of proper training data will lead to a reduction of the accuracy. In other words, the system may fail, especially if the testing project is at an early stage, i.e., there are few declarations and APIs, or when there are not enough training projects. One of the most practical countermeasures is to try to collect as many projects as we can to create a rich knowledge base for our tool. In this way, we increase the chance to find highly similar projects, given a project being developed.

Internal validity. Threats to internal validity are related to confounding factors, internal to our study, that could have influenced the results. One probable threat can be seen through the results obtained for the smaller dataset which consists of 500 apps in RQ 1 . As we already mentioned, this dataset was used to compare FOCUS with UP-Miner and PAM due to the limited scalability of PAM. Moreover, we also deliberately made use of such a small dataset in RQ 4 to study the extent to which FOCUS is able to recommend relevant source code. The intuition is that if it performs well 21 . https://www.vogella.com/tutorials/EclipseJDT/article.html 22. https://www.jetbrains.com/decompiler/ on a limited amount of training data, it will be also effective on a larger one.

In the comparison between UP-Miner, PAM, and FOCUS we used the implementation of PAM which was published online by its authors. Since the original implementation of UP-Miner is no longer available, we made use of the source code re-implemented by the authors of PAM. To mitigate the threats that may affect internal validity, we also evaluated the systems using exactly the same dataset and evaluation metrics. Furthermore, we ran several trials and counter-checks to validate the evaluation outcomes.

Concerning the user study, we (i) limited the extent to which results depend on personal skills by involving 16 Master's students having similar development background and experience; and (ii) did not disclose the goals of the experiment to avoid hypothesis guessing. Another threat is that, to simplify the setting, developers used FOCUS recommendations as HTML pages instead of having them in the IDE. 23 However, we evaluated the perceived usefulness of the recommendations, not of the tool.

External validity. The main threat to external validity is that our proposed approach is currently limited to Java and Kotlin programs. As stated in Section 3, however, FOCUS makes few assumptions on the underlying language and only requires information about method declarations and invocations to build the 3D matrix. This information could be extracted from programs written in any object-oriented programming language, and we wish to generalize FOCUS to other languages in the future. Also, in the future FOCUS may benefit from an in-field evaluation in an industrial setting.

The main threat to construct validity concerns the simulated setting used to evaluate the approaches, as opposed to performing a user study. We mitigated this threat by introducing four configurations that simulate different stages of the development process. In a real development setting, however, the order in which one writes statements might not fully reflect our simulation. Also, in a realistic usage setting, there may be cases in which an API recommender turns out to be more useful (e.g., when recommending API usages for which the developer has limited skills/knowledge), and cases (obvious code completion, or recommending usage scenarios for commonlyused APIs), where it is less useful. Such a threat has been mitigated with the user study in which participants evaluated the recommendations provided by FOCUS.

In the user study, we evaluated the perceived usefulness of the recommendations, which may or may not correspond to the actual usefulness. The outcome of the performed task did not show any significant difference in terms of correctness of the produced artifacts. This could possibly depend on the study setting (i.e., offline) in which participants had the time to properly implement the task and, possibly, search for plausible solutions and/or API documentation.

The adoption of recommender systems in software engineering aims at supporting developers in navigating large 23 . See for instance https://bit.ly/3d3i2hY information spaces and getting instant recommendations, which can give some guidance to undertake the particular development task at hand [44] . In this section, we present an overview of some representative recommender systems by focusing on those specifically conceived to support software development activities.

API usage pattern recommendation. Acharya et al. [3] present a framework to extract API patterns as partial orders from client code. To this aim, control-flow-sensitive static API traces are extracted from source code and sequential patterns are computed. Our approach is able to recommend both a list of API calls and related source code.

Zhong et al. implemented MAPO, a tool to retrieve API usage patterns from client projects [58] . MAPO extracts API usages from source files, and groups API methods into clusters. Afterwards, it mines API usage patterns from the clusters, ranks them according to their similarity with the current development context, and eventually suggests code snippets to developers. Similarly, UP-Miner [54] extracts API usage patterns by relying on SeqSim, a clustering strategy that reduces patterns redundancy as well as improves coverage. While these approaches are based on clustering techniques, and they consider all client projects in the mining regardless of their similarity with the current project, FOCUS narrows down the search scope by looking into similar projects. In this work, we see that FOCUS clearly outperforms UP-Miner.

PAM (Probabilistic API Miner) mines API usage patterns based on a parameter-free probabilistic algorithm [11] . The tool uses the structural Expectation-Maximization (EM) algorithm to infer the most probable API patterns from client code. PAM outperforms both MAPO and UP-Miner (lower redundancy and higher precision). Through a comparison of FOCUS with PAM, we see that our approach obtains a better performance with respect to various metrics.

NCBUP-miner (Non Client-based Usage Patterns) [42] is a technique that identifies unordered API usage patterns from the API source code, based on both structural (methods that modify the same object) and semantic (methods that have the same vocabulary) relations. The same authors also propose MLUP [41] , which is based on vector representation and clustering, but in this case client code is also considered.

XSnippet [40] suggests relevant code snippets starting from the developer's context. The system invokes different queries that consider both the parents of the class and the lexical visible type. Then, queries computed in such a way are passed to a module that mines relevant paths by relying on a graph-based structure. The ranking module eventually ranks the obtained snippets by employing six different heuristics.

DeepAPI [15] is a deep-learning method used to generate API usage sequences given a query in natural language. The learning problem is encoded as a machine translation problem, where queries are considered the source language and API sequences the target language. Only commented methods are considered during the search. The same authors [14] present CODEnn (COde-Description Embedding Neural Network), where, instead of API sequences, code snippets are retrieved to the developer based on semantic aspects such as API sequences, comments, method names, and tokens.

With respect to the aforementioned approaches, FOCUS uses CF techniques to recommend API calls and usage patterns from a set of similar projects. In the end, not only relevant API invocations are recommended, but also code snippets are returned to the developer as usage examples. [17] is an Eclipse plug-in that extracts the structural context of code and uses it as a query to request API usages from a remote repository. The system performs the match by employing six heuristics associated to class inheritance, method calls, and field types. In a similar fashion, the technique proposed by Buse and Weimer [6] synthesizes API examples for a given data type. An algorithm based on data-flow analysis, k-Medoids clustering and pattern abstraction is designed. Its outcome is a set of syntactically correct and well-typed code snippets where example length, exception handling, variables initialization and naming, abstract uses are considered.

MUSE (Method USage Examples) is an approach proposed by Moreno et al. [23] to recommending code examples a given API method. MUSE extracts API usages from client code, simplifies code examples with static slicing, and detects clones to group similar snippets. It also ranks examples according to certain properties, i.e., reusability, understandability, and popularity).

SWIM (Synthesizing What I Mean) [33] seeks API structured call sequences (control and data-flows are considered), and then synthesizes API-related code snippets according to a query in natural language. The underlying learning model is also built with the EM algorithm. Similarly, Raychev et al. [35] propose a code completion approach based on natural language processing, which receives as input a partial program and outputs a set of API call sequences filling the gaps of the input. Both invocations and invocation arguments are synthesized considering multiple types of an API.

Thummalapenta and Xie propose SpotWeb [48] , an approach that provides starting points (hotspots) for understanding a framework, and highlights where examples finding could be more challenging (coldspots). Other tools exploit StackOverflow discussions to suggest code snippets and documentation [9] , [31] , [32] , [34] , [36] , [47] , [50] .

cwi-swat/clair: v0.1.0

Mining API Patterns As Partial Orders from Source Code: From Usage Scenarios to Specifications

M3: A General Model for Code Analytics in Rascal

Modular language implementation in rascal -experience report

Synthesizing API Usage Examples

Detecting java software similarities by using different clustering techniques

Context-Aware Collaborative Filtering System: Predicting the User's Preference in the Ubiquitous Computing Environment

Context-Based Recommendation to Support Problem Solving in Software Development

Automated api-usage update for android apps

Parameter-free Probabilistic API Mining Across GitHub

A graph-based dataset of commit history of real-world android apps

Effect sizes for research: A broad practical approach

Deep Code Search

Deep API Learning

Php air: Analyzing php systems with rascal

Using Structural Context to Recommend Source Code Examples

Kerberoid: A practical android app decompilation system with multiple decompilers

The 80/20 Principle: The Secret of Achieving More with Less, ser. A Currency book

A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection

Binary Codes Capable of Correcting Deletions, Insertions and Reversals

Detecting similar software applications

How Can I Use This Method?

What Makes a Good Code Example?: A Study of Programming Q&A in StackOverflow

CrossRec: Supporting Software Developers by Recommending Third-party Libraries

FOCUS: A Recommender System for Mining API Function Calls and Usage Patterns

TSE FOCUS replication package

Modification to K-Medoids and CLARA for Effective Document Clustering

Questionnaire Design, Interviewing and Attitude Measurement

Information Distribution Aspects of Design Methodology

Mining StackOverflow to Turn the IDE into a Self-confident Programming Prompter

Supporting Software Developers with a Holistic Recommender System

SWIM: Synthesizing What I Mean: Code Search and Idiomatic Snippet Synthesis

Towards a Context-Aware IDE-Based Meta Search Engine for Recommendation about Programming Errors and Exceptions

Code Completion with Statistical Language Models

Discovering Essential Code Elements in Informal Documentation

What Makes APIs Hard to Learn? Answers from Developers

Automated API Property Inference Techniques

Understanding reuse in the Android Market

Xsnippet: Mining for sample code

Mining Multi-level API Usage Patterns

Could We Infer Unordered API Usage Patterns Only Using the Library Source Code

On the automatic categorisation of android applications

Collaborative Filtering Recommender Systems

The Adaptive Web: Methods and Strategies of Web Personalization

An investigation into android run-time permissions from the end users' perspective

A Spontaneous Code Recommendation Tool Based on Associative Search

SpotWeb: Detecting Framework Hotspots and Coldspots via Mining Open Source Code on the Web

An empirical study on the maintenance of source code clones

Augmenting API Documentation with Insights from Stack Overflow

How API Documentation Fails

Improving sales diversity by recommending users to items

A measurement study of google play

Mining Succinct and High-coverage API Usage Patterns from Source Code

Individual comparisons by ranking methods

Performance Evaluation of Classification Algorithms by K-fold and Leave-one-out Cross Validation

Towards reusing hints from past fixes: An exploratory study on thousands of real samples

MAPO: Mining and Recommending API Usage Patterns

Lancer: Your code tell me what you need

We presented FOCUS, a recommender system to provide developers with suitable API function calls and code snippets while they are programming. A thorough evaluation has been conducted (i) on an Android dataset to study the approach's performance, and (ii) in a user study with 16 participants to assess the perceived usefulness of FOCUS recommendations.We succeeded in integrating FOCUS into the Eclipse IDE, and we made available online the developed tool together with the parsed metadata [27] . This aims at providing the research community at large with a sound replication package, which then allows one to seamlessly reproduce the experiments presented in our paper.Future research in this area includes (i) replicating the empirical evaluation on further projects, by also, possibly, supporting further programming languages, and (ii) updating the code base of the Eclipse Scava project 24 (which embraces all the development outcomes produced in the context of the EU CROSSMINER project) with the FOCUS tool as presented in this paper. 24 . https://www.eclipse.org/scava/

The research described has been carried out as part of the CROSSMINER Project, which has received funding from the European Union's Horizon 2020 Research and Innovation Programme under grant agreement No. 732223. We thank Gian Luca Scoccia for providing us with the tool to extract APK files from Apkpure. We also thank the students who kindly participated in the user study, despite difficulties caused by the unprecedented pandemic. Finally, we are grateful to the anonymous reviewers for their valuable comments and suggestions that helped us improve the paper.