نتائج البحث - "Bavota, Gabriele"

1

تقرير

How the Training Procedure Impacts the Performance of Deep Learning-based Vulnerability Patching

المؤلفون: Mastropaolo, Antonio, Nardone, Vittoria, Bavota, Gabriele, Di Penta, Massimiliano

مصطلحات موضوعية: Computer Science - Software Engineering

الوصف: Generative deep learning (DL) models have been successfully adopted for vulnerability patching. However, such models require the availability of a large dataset of patches to learn from. To overcome this issue, researchers have proposed to start from models pre-trained with general knowledge, either on the programming language or on similar tasks such as bug fixing. Despite the efforts in the area of automated vulnerability patching, there is a lack of systematic studies on how these different training procedures impact the performance of DL models for such a task. This paper provides a manyfold contribution to bridge this gap, by (i) comparing existing solutions of self-supervised and supervised pre-training for vulnerability patching; and (ii) for the first time, experimenting with different kinds of prompt-tuning for this task. The study required to train/test 23 DL models. We found that a supervised pre-training focused on bug-fixing, while expensive in terms of data collection, substantially improves DL-based vulnerability patching. When applying prompt-tuning on top of this supervised pre-trained model, there is no significant gain in performance. Instead, prompt-tuning is an effective and cheap solution to substantially boost the performance of self-supervised pre-trained models, i.e., those not relying on the bug-fixing pre-training.

الوصول الحر: http://arxiv.org/abs/2404.17896Test

View record in Arxiv

2

تقرير

On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions

المؤلفون: Ciniselli, Matteo, Martin-Lopez, Alberto, Bavota, Gabriele

مصطلحات موضوعية: Computer Science - Software Engineering

الوصف: Code completion is a key feature of Integrated Development Environments (IDEs), aimed at predicting the next tokens a developer is likely to write, helping them write code faster and with less effort. Modern code completion approaches are often powered by deep learning (DL) models. However, the swift evolution of programming languages poses a critical challenge to the performance of DL-based code completion models: Can these models generalize across different language versions? This paper delves into such a question. In particular, we assess the capabilities of a state-of-the-art model, CodeT5, to generalize across nine different Java versions, ranging from Java 2 to Java 17, while being exclusively trained on Java 8 code. Our evaluation spans three completion scenarios, namely, predicting tokens, constructs (e.g., the condition of an if statement) and entire code blocks. The results of our study reveal a noticeable disparity among language versions, with the worst performance being obtained in Java 2 and 17 - the most far apart versions compared to Java 8. We investigate possible causes for the performance degradation and show that the adoption of a limited version-specific fine-tuning can partially alleviate the problem. Our work raises awareness on the importance of continuous model refinement, and it can inform the design of alternatives to make code completion models more robust to language evolution.

الوصول الحر: http://arxiv.org/abs/2403.15149Test

View record in Arxiv

3

تقرير

Unveiling ChatGPT's Usage in Open Source Projects: A Mining-based Study

المؤلفون: Tufano, Rosalia, Mastropaolo, Antonio, Pepe, Federica, Dabić, Ozren, Di Penta, Massimiliano, Bavota, Gabriele

مصطلحات موضوعية: Computer Science - Software Engineering

الوصف: Large Language Models (LLMs) have gained significant attention in the software engineering community. Nowadays developers have the possibility to exploit these models through industrial-grade tools providing a handy interface toward LLMs, such as OpenAI's ChatGPT. While the potential of LLMs in assisting developers across several tasks has been documented in the literature, there is a lack of empirical evidence mapping the actual usage of LLMs in software projects. In this work, we aim at filling such a gap. First, we mine 1,501 commits, pull requests (PRs), and issues from open-source projects by matching regular expressions likely to indicate the usage of ChatGPT to accomplish the task. Then, we manually analyze these instances, discarding false positives (i.e., instances in which ChatGPT was mentioned but not actually used) and categorizing the task automated in the 467 true positive instances (165 commits, 159 PRs, 143 issues). This resulted in a taxonomy of 45 tasks which developers automate via ChatGPT. The taxonomy, accompanied with representative examples, provides (i) developers with valuable insights on how to exploit LLMs in their workflow and (ii) researchers with a clear overview of tasks that, according to developers, could benefit from automated solutions.
Comment: Paper accepted for publication at 21st International Conference on Mining Software Repositories (MASR'24)

الوصول الحر: http://arxiv.org/abs/2402.16480Test

View record in Arxiv

4

تقرير

Towards Summarizing Code Snippets Using Pre-Trained Transformers

المؤلفون: Mastropaolo, Antonio, Ciniselli, Matteo, Pascarella, Luca, Tufano, Rosalia, Aghajani, Emad, Bavota, Gabriele

مصطلحات موضوعية: Computer Science - Software Engineering

الوصف: When comprehending code, a helping hand may come from the natural language comments documenting it that, unfortunately, are not always there. To support developers in such a scenario, several techniques have been presented to automatically generate natural language summaries for a given code. Most recent approaches exploit deep learning (DL) to automatically document classes or functions, while little effort has been devoted to more fine-grained documentation (e.g., documenting code snippets or even a single statement). Such a design choice is dictated by the availability of training data: For example, in the case of Java, it is easy to create datasets composed of pairs that can be fed to DL models to teach them how to summarize a method. Such a comment-to-code linking is instead non-trivial when it comes to inner comments documenting a few statements. In this work, we take all the steps needed to train a DL model to document code snippets. First, we manually built a dataset featuring 6.6k comments that have been (i) classified based on their type (e.g., code summary, TODO), and (ii) linked to the code statements they document. Second, we used such a dataset to train a multi-task DL model, taking as input a comment and being able to (i) classify whether it represents a "code summary" or not and (ii) link it to the code statements it documents. Our model identifies code summaries with 84% accuracy and is able to link them to the documented lines of code with recall and precision higher than 80%. Third, we run this model on 10k projects, identifying and linking code summaries to the documented code. This unlocked the possibility of building a large-scale dataset of documented code snippets that have then been used to train a new DL model able to document code snippets. A comparison with state-of-the-art baselines shows the superiority of the proposed approach.

الوصول الحر: http://arxiv.org/abs/2402.00519Test

View record in Arxiv

5

تقرير

Code Review Automation: Strengths and Weaknesses of the State of the Art

المؤلفون: Tufano, Rosalia, Dabić, Ozren, Mastropaolo, Antonio, Ciniselli, Matteo, Bavota, Gabriele

مصطلحات موضوعية: Computer Science - Software Engineering

الوصف: The automation of code review has been tackled by several researchers with the goal of reducing its cost. The adoption of deep learning in software engineering pushed the automation to new boundaries, with techniques imitating developers in generative tasks, such as commenting on a code change as a reviewer would do or addressing a reviewer's comment by modifying code. The performance of these techniques is usually assessed through quantitative metrics, e.g., the percentage of instances in the test set for which correct predictions are generated, leaving many open questions on the techniques' capabilities. For example, knowing that an approach is able to correctly address a reviewer's comment in 10% of cases is of little value without knowing what was asked by the reviewer: What if in all successful cases the code change required to address the comment was just the removal of an empty line? In this paper we aim at characterizing the cases in which three code review automation techniques tend to succeed or fail in the two above-described tasks. The study has a strong qualitative focus, with ~105 man-hours of manual inspection invested in manually analyzing correct and wrong predictions generated by the three techniques, for a total of 2,291 inspected predictions. The output of this analysis are two taxonomies reporting, for each of the two tasks, the types of code changes on which the experimented techniques tend to succeed or to fail, pointing to areas for future work. A result of our manual analysis was also the identification of several issues in the datasets used to train and test the experimented techniques. Finally, we assess the importance of researching in techniques specialized for code review automation by comparing their performance with ChatGPT, a general purpose large language model, finding that ChatGPT struggles in commenting code as a human reviewer would do.

الوصول الحر: http://arxiv.org/abs/2401.05136Test

View record in Arxiv

6

تقرير

Evaluating Code Summarization Techniques: A New Metric and an Empirical Characterization

المؤلفون: Mastropaolo, Antonio, Ciniselli, Matteo, Di Penta, Massimiliano, Bavota, Gabriele

مصطلحات موضوعية: Computer Science - Software Engineering

الوصف: Several code summarization techniques have been proposed in the literature to automatically document a code snippet or a function. Ideally, software developers should be involved in assessing the quality of the generated summaries. However, in most cases, researchers rely on automatic evaluation metrics such as BLEU, ROUGE, and METEOR. These metrics are all based on the same assumption: The higher the textual similarity between the generated summary and a reference summary written by developers, the higher its quality. However, there are two reasons for which this assumption falls short: (i) reference summaries, e.g., code comments collected by mining software repositories, may be of low quality or even outdated; (ii) generated summaries, while using a different wording than a reference one, could be semantically equivalent to it, thus still being suitable to document the code snippet. In this paper, we perform a thorough empirical investigation on the complementarity of different types of metrics in capturing the quality of a generated summary. Also, we propose to address the limitations of existing metrics by considering a new dimension, capturing the extent to which the generated summary aligns with the semantics of the documented code snippet, independently from the reference summary. To this end, we present a new metric based on contrastive learning to capture said aspect. We empirically show that the inclusion of this novel dimension enables a more effective representation of developers' evaluations regarding the quality of automatically generated summaries.

الوصول الحر: http://arxiv.org/abs/2312.15475Test

View record in Arxiv

7

تقرير

Log Statements Generation via Deep Learning: Widening the Support Provided to Developers

المؤلفون: Mastropaolo, Antonio, Ferrari, Valentina, Pascarella, Luca, Bavota, Gabriele

مصطلحات موضوعية: Computer Science - Software Engineering

الوصف: Logging assists in monitoring events that transpire during the execution of software. Previous research has highlighted the challenges confronted by developers when it comes to logging, including dilemmas such as where to log, what data to record, and which log level to employ (e.g., info, fatal). In this context, we introduced LANCE, an approach rooted in deep learning (DL) that has demonstrated the ability to correctly inject a log statement into Java methods in ~15% of cases. Nevertheless, LANCE grapples with two primary constraints: (i) it presumes that a method necessitates the inclusion of logging statements and; (ii) it allows the injection of only a single (new) log statement, even in situations where the injection of multiple log statements might be essential. To address these limitations, we present LEONID, a DL-based technique that can distinguish between methods that do and do not require the inclusion of log statements. Furthermore, LEONID supports the injection of multiple log statements within a given method when necessary, and it also enhances LANCE's proficiency in generating meaningful log messages through the combination of DL and Information Retrieval (IR).

الوصول الحر: http://arxiv.org/abs/2311.04587Test

View record in Arxiv

8

تقرير

Toward Automatically Completing GitHub Workflows

المؤلفون: Mastropaolo, Antonio, Zampetti, Fiorella, Bavota, Gabriele, Di Penta, Massimiliano

مصطلحات موضوعية: Computer Science - Software Engineering

الوصف: Continuous integration and delivery (CI/CD) are nowadays at the core of software development. Their benefits come at the cost of setting up and maintaining the CI/CD pipeline, which requires knowledge and skills often orthogonal to those entailed in other software-related tasks. While several recommender systems have been proposed to support developers across a variety of tasks, little automated support is available when it comes to setting up and maintaining CI/CD pipelines. We present GH-WCOM (GitHub Workflow COMpletion), a Transformer-based approach supporting developers in writing a specific type of CI/CD pipelines, namely GitHub workflows. To deal with such a task, we designed an abstraction process to help the learning of the transformer while still making GH-WCOM able to recommend very peculiar workflow elements such as tool options and scripting elements. Our empirical study shows that GH-WCOM provides up to 34.23% correct predictions, and the model's confidence is a reliable proxy for the recommendations' correctness likelihood.

الوصول الحر: http://arxiv.org/abs/2308.16774Test

View record in Arxiv

9

تقرير

Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We?

المؤلفون: Mastropaolo, Antonio, Di Penta, Massimiliano, Bavota, Gabriele

مصطلحات موضوعية: Computer Science - Software Engineering, Computer Science - Artificial Intelligence

الوصف: Upon evolving their software, organizations and individual developers have to spend a substantial effort to pay back technical debt, i.e., the fact that software is released in a shape not as good as it should be, e.g., in terms of functionality, reliability, or maintainability. This paper empirically investigates the extent to which technical debt can be automatically paid back by neural-based generative models, and in particular models exploiting different strategies for pre-training and fine-tuning. We start by extracting a dateset of 5,039 Self-Admitted Technical Debt (SATD) removals from 595 open-source projects. SATD refers to technical debt instances documented (e.g., via code comments) by developers. We use this dataset to experiment with seven different generative deep learning (DL) model configurations. Specifically, we compare transformers pre-trained and fine-tuned with different combinations of training objectives, including the fixing of generic code changes, SATD removals, and SATD-comment prompt tuning. Also, we investigate the applicability in this context of a recently-available Large Language Model (LLM)-based chat bot. Results of our study indicate that the automated repayment of SATD is a challenging task, with the best model we experimented with able to automatically fix ~2% to 8% of test instances, depending on the number of attempts it is allowed to make. Given the limited size of the fine-tuning dataset (~5k instances), the model's pre-training plays a fundamental role in boosting performance. Also, the ability to remove SATD steadily drops if the comment documenting the SATD is not provided as input to the model. Finally, we found general-purpose LLMs to not be a competitive approach for addressing SATD.

الوصول الحر: http://arxiv.org/abs/2308.08943Test

View record in Arxiv

10

تقرير

Using Gameplay Videos for Detecting Issues in Video Games

المؤلفون: Guglielmi, Emanuela, Scalabrino, Simone, Bavota, Gabriele, Oliveto, Rocco

مصطلحات موضوعية: Computer Science - Software Engineering

الوصف: Context. The game industry is increasingly growing in recent years. Every day, millions of people play video games, not only as a hobby, but also for professional competitions (e.g., e-sports or speed-running) or for making business by entertaining others (e.g., streamers). The latter daily produce a large amount of gameplay videos in which they also comment live what they experience. But no software and, thus, no video game is perfect: Streamers may encounter several problems (such as bugs, glitches, or performance issues) while they play. Also, it is unlikely that they explicitly report such issues to developers. The identified problems may negatively impact the user's gaming experience and, in turn, can harm the reputation of the game and of the producer. Objective. In this paper, we propose and empirically evaluate GELID, an approach for automatically extracting relevant information from gameplay videos by (i) identifying video segments in which streamers experienced anomalies; (ii) categorizing them based on their type (e.g., logic or presentation); clustering them based on (iii) the context in which appear (e.g., level or game area) and (iv) on the specific issue type (e.g., game crashes). Method. We manually defined a training set for step 2 of GELID (categorization) and a test set for validating in isolation the four components of GELID. In total, we manually segmented, labeled, and clustered 170 videos related to 3 video games, defining a dataset containing 604 segments. Results. While in steps 1 (segmentation) and 4 (specific issue clustering) GELID achieves satisfactory results, it shows limitations on step 3 (game context clustering) and, above all, step 2 (categorization).
Comment: Accepted at Empirical Software Engineering journal (EMSE). arXiv admin note: text overlap with arXiv:2204.04182

الوصول الحر: http://arxiv.org/abs/2307.14749Test

View record in Arxiv

تنقيح النتائج