ACAT 2025

Name: ACAT 2025
Start: 2025-09-08T08:00:00+02:00
End: 2025-09-12T16:30:00+02:00
Location: Hamburg, Germany

8–12 Sept 2025

Hamburg, Germany

Europe/Berlin timezone

LLM-based Code Documentation, Generation, and Optimization AI Assistant

Not scheduled

30m

Hamburg, Germany

Poster Track 1: Computing Technology for Physics Research Poster session with coffee break

Mohammad Atif (Brookhaven National Laboratory)

Recent advancements in large language models (LLMs) have paved the way for tools that can enhance the software development process for scientists. In this context, LLMs excel at two tasks -- code documentation in natural language and code generation in a given programming language. The commercially available tools are often restricted by the available context window size, encounter usage limits, or sometimes incur a substantial cost for large scale development/modernization of gigantic scientific codebases. Moreover, there are data privacy/security concerns. Thus, a programmatic framework that can be used from the Linux terminal on a local server is needed. This framework should be less laborious to use by batching code documentation of large codebases and must be able to use the latest open models offline for code generation.

We present a retrieval-augmented generation (RAG)-based AI assistant for code documentation and generation which is entirely local and scalable. The advantage of this setup is the availability of a large context window free of any external recurring costs while ensuring transparency. The code documentation assistant has three components -- (a) Doxygen style comment generation for all functions and classes by retrieving relevant information from RAG sources (papers, posters, presentations), (b) file-level summary generation, and (c) an interactive chatbot. This will help improve code comprehension for new members in a research group and enhance the understanding of sparsely documented codebases. The code generation assistant splits the code in self-contained chunks before embedding – this strategy improves code retrieval in large codebases. We compare different text and code embedding models for code retrieval. This is followed by AI generated suggestions for performance optimization and accurate refactoring while employing call graphs knowledge to maintain comprehensiveness in large codebases. Additionally, we discuss the guardrails required to ensure code maintainability and correctness when using LLMs for code generation.

Significance

The impact of text and code based LLMs on scientific software development needs to be critically investigated. Their rapid and continuing growth have made an overwhelming number of options available for domain scientists. To this effect, we develop a local and secure framework and evaluate several LLMs for code documentation and generation. We also discuss the role of test-driven development and guardrails that are required to ensure the quality of generated code.

Mohammad Atif (Brookhaven National Laboratory) Kriti Chopra (Brookhaven National Lab) Ozgur Ozan Kilic (Brookhaven National Laboratory) Dr Charles Leggett (Lawrence Berkeley National Lab (US)) Meifeng Lin (Brookhaven National Laboratory (US)) Tianle Wang (Brookhaven National Lab) Zhihua Dong

There are no materials yet.

ACAT 2025

LLM-based Code Documentation, Generation, and Optimization AI Assistant

Hamburg, Germany

Speaker

Description

Significance

Authors

Presentation materials

Choose timezone

ACAT 2025

Speaker

Description

Significance

Authors

Presentation materials