Skip to content

Latest commit

 

History

History
607 lines (363 loc) · 96.6 KB

File metadata and controls

607 lines (363 loc) · 96.6 KB

Empirical Study

Agent Design

  • Can GPT-4 Replicate Empirical Software Engineering Research?, (FSE2024)

    • Abstract: Empirical software engineering research on production systems has brought forth a better understanding of the software engineering process for practitioners and researchers alike. However, only a small subset of production systems is studied, limiting the impact of this research. While software engineering practitioners could benefit from replicating research on their own data, this poses its own set of challenges, since performing replications requires a deep understanding of research methodolo...
    • Labels: empirical study, agent design
  • Complementary explanations for effective in-context learning, (ACL2023)

    • Abstract: Large language models (LLMs) have exhibited remarkable capabilities in learning from explanations in prompts, but there has been limited understanding of exactly how these explanations function or why they are effective. This work aims to better understand the mechanisms by which explanations are used for in-context learning. We first study the impact of two different factors on the performance of prompts with explanations: the computation trace (the way the solution is decomposed) and the natur...
    • Labels: agent design, prompt strategy, reason with code, empirical study
  • Explanation selection using unlabeled data for chain-of-thought prompting, (EMNLP2023)

    • Abstract: Recent work has shown how to prompt large language models with explanations to obtain strong performance on textual reasoning tasks, i.e., the chain-of-thought paradigm. However, subtly different explanations can yield widely varying downstream task accuracy. Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance. This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fash...
    • Labels: agent design, prompt strategy, reason with code, empirical study
  • Self-Planning Code Generation with Large Language Models, (TOSEM2024)

    • Abstract: Although large language models (LLMs) have demonstrated impressive ability in code generation, they are still struggling to address the complicated intent provided by humans. It is widely acknowledged that humans typically employ planning to decompose complex problems and schedule solution steps prior to implementation. To this end, we introduce planning into code generation to help the model understand complex intent and reduce the difficulty of problem-solving. This paper proposes a self-plann...
    • Labels: code generation, program synthesis, agent design, planning, empirical study
  • When Do Program-of-Thought Works for Reasoning?, (AAAI2024)

    • Abstract: In the realm of embodied artificial intelligence, the reasoning capabilities of Large Language Models (LLMs) play a pivotal role. Although there are effective methods like program-of-thought prompting for LLMs which uses programming language to tackle complex reasoning tasks, the specific impact of code data on the improvement of reasoning capabilities remains under-explored. To address this gap, we propose complexity-impacted reasoning score CIRS, which combines structural and logical attribute...
    • Labels: agent design, prompt strategy, reason with code, empirical study

Code Generation

Code Model

General Coding Task

  • Automatic Programming: Large Language Models and Beyond, (arXiv2024)

    • Abstract: Automatic programming has seen increasing popularity due to the emergence of tools like GitHub Copilot which rely on Large Language Models (LLMs). At the same time, automatically generated code faces challenges during deployment due to concerns around quality and trust. In this article, we study automated coding in a general sense and study the concerns around code quality, security and related issues of programmer responsibility. These are key issues for organizations while deciding on the usag...
    • Labels: general coding task, empirical study
  • Automating Code-Related Tasks Through Transformers: The Impact of Pre-Training, (ICSE2023)

    • Abstract: Transformers have gained popularity in the software engineering (SE) literature. These deep learning models are usually pre-trained through a self-supervised objective, meant to provide the model with basic knowledge about a language of interest (e.g., Java). A classic pre-training objective is the masked language model (MLM), in which a percentage of tokens from the input (e.g., a Java method) is masked, with the model in charge of predicting them. Once pre-trained, the model is then fine-tuned...
    • Labels: general coding task, code model, code model training, source code model, empirical study
  • Codemind: A framework to challenge large language models for code reasoning, (arXiv2024)

    • Abstract: Solely relying on test passing to evaluate Large Language Models (LLMs) for code synthesis may result in unfair assessment or promoting models with data leakage. As an alternative, we introduce CodeMind, a framework designed to gauge the code reasoning abilities of LLMs. CodeMind currently supports three code reasoning tasks: Independent Execution Reasoning (IER), Dependent Execution Reasoning (DER), and Specification Reasoning (SR). The first two evaluate models to predict the execution output ...
    • Labels: general coding task, empirical study
  • Exploring Distributional Shifts in Large Language Models for Code Analysis, (EMNLP2023)

    • Abstract: We systematically study how three large language models with code capabilities - CodeT5, Codex, and ChatGPT - generalize to out-of-domain data. We consider two fundamental applications - code summarization, and code generation. We split data into domains following its natural boundaries - by an organization, by a project, and by a module within the software project. We establish that samples from each new domain present all the models with a significant challenge of distribution shift. We study ...
    • Labels: general coding task, code model, code model training, source code model, empirical study
  • Integrate the Essence and Eliminate the Dross: Fine-Grained Self-Consistency for Free-Form Language Generation, (ACL2024)

    • Abstract: Self-consistency (SC), leveraging multiple samples from LLMs, shows significant gains on various reasoning tasks but struggles with free-form generation due to the difficulty of aggregating answers. Its variants, UCS and USC, rely on sample selection or voting mechanisms to improve output quality. These methods, however, face limitations due to their inability to fully utilize the nuanced consensus knowledge present within multiple candidate samples, often resulting in suboptimal outputs. We pro...
    • Labels: general coding task, empirical study
  • Mastering the Craft of Data Synthesis for {C}ode{LLM}s, (NAACL2025)

    • Abstract: Large language models (LLMs) have shown impressive performance in \textit{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore f...
    • Labels: general coding task, empirical study

Hallucination In Reasoning

Program Testing

  • A Large-Scale Empirical Study on Fine-Tuning Large Language Models for Unit Testing, (ISSTA2025)

    • Abstract: Unit testing plays a pivotal role in software development, improving software quality and reliability. However, generating effective test cases manually is time-consuming, prompting interest in unit testing research. Recently, Large Language Models (LLMs) have shown potential in various unit testing tasks, including test generation, assertion generation, and test evolution, but existing studies are limited in scope and lack a systematic evaluation of the effectiveness of LLMs. To bridge thi...
    • Labels: program testing, unit testing, empirical study
  • An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation, (TSE2024)

    • Abstract: Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to various aspects of software development, including their suggested use for automated generation of unit tests, but while requiring additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness...
    • Labels: program testing, unit testing, empirical study
  • ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation, (TSE2024)

    • Abstract: Recent advancements in large language models (LLMs) have demonstrated exceptional success in a wide range of general domain tasks, such as question answering and following instructions. Moreover, LLMs have shown potential in various software engineering applications. In this study, we present a systematic comparison of test suites generated by the ChatGPT LLM and the state-of-the-art SBST tool EvoSuite. Our comparison is based on several critical factors, including correctness, readability, code...
    • Labels: program testing, unit testing, empirical study
  • Doc2OracLL: Investigating the Impact of Documentation on LLM-Based Test Oracle Generation, (FSE2025)

    • Abstract: Code documentation is a critical artifact of software development, bridging human understanding and machine- readable code. Beyond aiding developers in code comprehension and maintenance, documentation also plays a critical role in automating various software engineering tasks, such as test oracle generation (TOG). In Java, Javadoc comments offer structured, natural language documentation embedded directly within the source code, typically describing functionality, usage, parameters, ret...
    • Labels: program testing, general testing, empirical study
  • Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction, (TSE2024)

    • Abstract: Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly focused on crash bugs, which are easier to automatically detect and verify. In this work, we overcome this limitation by using large language models (LLMs), which have been demonstrated to be adept at natural language processing and code generation. By prompti...
    • Labels: program testing, bug reproduction, empirical study
  • Evaluating and Improving ChatGPT for Unit Test Generation, (FSE2024)

    • Abstract: Unit testing plays an essential role in detecting bugs in functionally-discrete program units (e.g., methods). Manually writing high-quality unit tests is time-consuming and laborious. Although the traditional techniques are able to generate tests with reasonable coverage, they are shown to exhibit low readability and still cannot be directly adopted by developers in practice. Recent work has shown the large potential of large language models (LLMs) in unit test generation. By being pre-trained ...
    • Labels: program testing, unit testing, empirical study, code generation
  • Large Language Models for Equivalent Mutant Detection: How Far Are We?, (ISSTA2024)

    • Abstract: Mutation testing is vital for ensuring software quality. However, the presence of equivalent mutants is known to introduce redundant cost and bias issues, hindering the effectiveness of mutation testing in practical use. Although numerous equivalent mutant detection (EMD) techniques have been proposed, they exhibit limitations due to the scarcity of training data and challenges in generalizing to unseen mutants. Recently, large language models (LLMs) have been extensively adopted in various code...
    • Labels: program testing, mutation testing, empirical study
  • Less Is More: On the Importance of Data Quality for Unit Test Generation, (FSE2025)

    • Abstract: Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of da...
    • Labels: program testing, unit testing, empirical study
  • On the Evaluation of Large Language Models in Unit Test Generation, (ASE2024)

    • Abstract: Unit testing is an essential activity in software development for verifying the correctness of software components. However, manually writing unit tests is challenging and time-consuming. The emergence of Large Language Models (LLMs) offers a new direction for automating unit test generation. Existing research primarily focuses on closed-source LLMs (e.g., ChatGPT and CodeX) with fixed prompting strategies, leaving the capabilities of advanced open-source LLMs with various prompting settings une...
    • Labels: program testing, unit testing, empirical study
  • Reasoning Runtime Behavior of a Program with LLM: How Far are We?, (ICSE2025)

    • Abstract: Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and ClassEval). Code reasoning is one of the most essential abilities of code LLMs (i.e., predicting code execution behaviors such as program output and execution path), but existing benchmarks for code reasoning are not sufficient. Typically, they focus on predicting th...
    • Labels: program testing, debugging, benchmark, empirical study
  • TOGLL: Correct and Strong Test Oracle Generation with LLMS, (ICSE2025)

    • Abstract: Test oracles play a crucial role in software testing, enabling effective bug detection. Despite initial promise, neural methods for automated test oracle generation often result in a large number of false positives and weaker test oracles. While LLMs have shown impressive effectiveness in various software engineering tasks, including code generation, test case creation, and bug fixing, there remains a notable absence of large-scale studies exploring their effectiveness in test oracle generation....
    • Labels: program testing, empirical study
  • Towards Understanding the Effectiveness of Large Language Models on Directed Test Input Generation, (ASE2024)

    • Abstract: Automatic testing has garnered significant attention and success over the past few decades. Techniques such as unit testing and coverage-guided fuzzing have revealed numerous critical software bugs and vulnerabilities. However, a long-standing, formidable challenge for existing techniques is how to achieve higher testing coverage. Constraint-based techniques, such as symbolic execution and concolic testing, have been well-explored and integrated into the existing approaches. With the popularity ...
    • Labels: program testing, unit testing, empirical study

Software Maintenance And Deployment

Static Analysis