Paper status: completed

Utilizing LLMs for Industrial Process Automation: A Case Study on Modifying RAPID Programs

Published:11/14/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study explores using existing Large Language Models for industrial process automation, specifically modifying RAPID programming. It finds that few-shot prompting can effectively address simple issues without extensive model training, while ensuring the security of sensitive

Abstract

How to best use Large Language Models (LLMs) for software engineering is covered in many publications in recent years. However, most of this work focuses on widely-used general purpose programming languages. The utility of LLMs for software within the industrial process automation domain, with highly-specialized languages that are typically only used in proprietary contexts, is still underexplored. Within this paper, we study enterprises can achieve on their own without investing large amounts of effort into the training of models specific to the domain-specific languages that are used. We show that few-shot prompting approaches are sufficient to solve simple problems in a language that is otherwise not well-supported by an LLM and that is possible on-premise, thereby ensuring the protection of sensitive company data.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Utilizing LLMs for Industrial Process Automation: A Case Study on Modifying RAPID Programs

1.2. Authors

  • Salim Fares: University of Passau, Passau, Germany.

  • Steffen Herbold: University of Passau, Passau, Germany.

    The authors are affiliated with the University of Passau, indicating an academic research background in computer science or software engineering. Their work is conducted in collaboration with an industrial partner, AKE Technologies, grounding the research in a real-world application context.

1.3. Journal/Conference

The paper is available as a preprint on arXiv. arXiv is a well-known open-access repository for scholarly articles in fields like physics, mathematics, computer science, and quantitative biology. As a preprint, it has not yet undergone formal peer review for publication in a conference or journal. The publication date suggests it is intended for a future venue.

1.4. Publication Year

2025 (as indicated by the publication date on arXiv).

1.5. Abstract

The paper investigates the utility of Large Language Models (LLMs) for software engineering tasks within the specialized domain of industrial process automation. Unlike most research, which focuses on popular general-purpose programming languages, this study examines how LLMs can handle highly specialized, proprietary languages with limited public data, such as ABB's RAPID. The authors explore what enterprises can achieve with existing LLMs without the significant investment required for model training. They demonstrate that few-shot prompting is an effective approach for solving simple code modification tasks in RAPID. The study emphasizes the use of an on-premise model to protect sensitive company data.

2. Executive Summary

2.1. Background & Motivation

  • Core Problem: The application and effectiveness of Large Language Models (LLMs) in software engineering have been extensively studied for common programming languages like Python and Java. However, their utility for niche, domain-specific languages (DSLs) used in industrial automation remains largely unexplored. These languages, such as ABB's RAPID, are often proprietary, have very little publicly available source code for training, and are used to develop sensitive, business-critical software (e.g., production line logic).
  • Importance and Gaps: Small and Medium-sized Enterprises (SMEs) in the industrial sector could benefit from AI-powered coding assistance, but training custom LLMs for their specific DSLs is prohibitively expensive and requires vast amounts of data they may not have or be willing to share. Existing research often focuses on generating code in general-purpose languages like Python for robot control, which is not directly applicable in industrial settings where proprietary DSLs are the norm.
  • Innovative Idea: The paper's entry point is to investigate whether a general-purpose, off-the-shelf LLM can be effectively prompted to perform useful code modification tasks in a niche language without any specialized training (fine-tuning). This "low-investment" approach, combined with on-premise deployment to ensure data privacy, offers a pragmatic path for SMEs to leverage LLM technology.

2.2. Main Contributions / Findings

  • Primary Contributions:

    1. Feasibility Demonstration: The paper demonstrates that a general-purpose LLM (Llama 3.1 70B) can perform simple-to-moderately complex modifications of RAPID code using only few-shot prompt engineering. This eliminates the need for expensive, task-specific model retraining.
    2. Systematic Evaluation: The study systematically evaluates the LLM's performance across three tasks of varying complexity: argument modification (low), adding offsets (medium), and reversing routines (high). This provides insights into the current capabilities and limitations of LLMs for such tasks.
    3. Practical Framework: It proposes a practical framework consisting of an on-premise LLM and a custom rule-based validator. This combination allows for iterative generation until a correct and compliant output is produced, increasing overall accuracy while maintaining data security.
    4. Comparative Discussion: The paper provides a qualitative comparison between the LLM-based solution and a traditional, non-AI approach (e.g., static code analysis), discussing the trade-offs in terms of expertise required, flexibility, and implementation effort.
  • Key Findings:

    • For low-complexity tasks like modifying arguments, the LLM achieves near-perfect accuracy (~99%).
    • For moderately complex tasks involving structural changes like adding new instructions, accuracy remains high (~91%) but often requires multiple generation attempts.
  • For high-complexity tasks that demand rewriting logic flows, such as reversing a movement routine, performance drops but is still significant (~77-83%).

    • Using English prompts yields superior performance compared to German prompts, even when the source code contains German comments, especially for more complex tasks.
    • A validation mechanism is crucial for reliably using LLMs in this context, as it allows for filtering incorrect outputs and retrying generation.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Large Language Models (LLMs): LLMs are a type of artificial intelligence model built on deep learning architectures (typically Transformers). They are trained on massive amounts of text and code data, enabling them to understand, generate, and manipulate human language and programming languages. Their core capability is predicting the next word or token in a sequence, which allows them to perform tasks like code generation, translation, and summarization.
  • Few-Shot Prompting: This is a technique used to guide an LLM's output without retraining it. Instead of just giving the model a command (zero-shot), you provide it with a few examples (shots) of the desired input-output format within the prompt itself. The LLM then learns the pattern from these examples and applies it to a new, unseen input. This is a cost-effective way to adapt a general-purpose LLM to a specific task.
  • Domain-Specific Language (DSL): A DSL is a programming language created for a specific application domain. Unlike a general-purpose language (e.g., Python, Java) designed for a wide range of problems, a DSL offers specialized syntax and features tailored to one area. RAPID (Robotics Application Programming Interactive Dialogue) is a DSL developed by ABB for programming their industrial robots.
  • Programmable Logic Controllers (PLCs): PLCs are ruggedized industrial computers that are the backbone of automation in factories and manufacturing plants. They are programmed to control machinery, assembly lines, robotic devices, and any activity that requires high reliability and precise timing. The software for PLCs is often written in specialized languages defined by standards like IEC 61131-3.
  • On-Premise Deployment: This refers to running software, in this case, an LLM, on a company's own local servers and infrastructure, rather than using a cloud-based service provided by a third party (like OpenAI's API). For industrial companies, on-premise deployment is critical for protecting proprietary source code, production process data, and other sensitive intellectual property.

3.2. Previous Works

The paper positions its research by contrasting it with several related areas:

  • LLMs for PLC Code Generation (LLM4PLC): Fakih et al. [7] developed LLM4PLC, a pipeline that uses LLMs to generate PLC code from natural language. Their approach is more complex, involving user feedback loops, external verification tools, and model fine-tuning using techniques like Low-Rank Adaptations (LoRAs). This contrasts with the current paper's goal of using an off-the-shelf LLM with only prompt engineering.
  • Generator-Supervisor LLM Frameworks: Antero et al. [4] proposed a two-LLM system where a "Generator" LLM creates task plans for a robot using predefined software blocks, and a "Supervisory" LLM validates the plans. This is a higher-level planning task, whereas the current paper focuses on low-level source code modification.
  • AI-Assisted Cobot Programming: Morano-Okuno et al. [15] created a framework using digital twins and AI to help non-experts program collaborative robots (cobots) interactively in a simulation. Their system suggests and refines task plans but does not generate executable code directly, which is the primary focus of this paper.
  • LLMs for General-Purpose Robot Control (Instruct2Act): Huang et al. [9] demonstrated that LLMs can generate executable Python code to control robots for manipulation tasks. This work is significant but relies on Python, a general-purpose language with extensive training data and high-level libraries. The challenge in the current paper is handling a proprietary DSL (RAPID) with a much smaller data footprint.

3.3. Technological Evolution

The use of AI in robotics and automation has evolved from traditional rule-based systems and machine learning for specific tasks (like vision) to more general, language-driven approaches. Early research focused on generating code from formal specifications. With the rise of LLMs, the focus has shifted to generating code from natural language or modifying existing codebases. However, much of this progress has been concentrated in the open-source software world, leveraging massive datasets like GitHub. The industrial automation sector, with its proprietary systems and closed data ecosystems, has been slower to adopt these technologies. This paper represents a step towards bridging that gap by exploring low-cost, high-privacy methods for applying modern LLMs in these constrained environments.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's approach is differentiated by:

  • Focus on a Proprietary DSL: It specifically targets RAPID, a niche, proprietary language, unlike most studies that use Python or other mainstream languages.
  • No Fine-Tuning: The core innovation is achieving useful results without any model retraining or fine-tuning, relying solely on prompt engineering. This makes the approach accessible and affordable for SMEs.
  • Emphasis on Code Modification: The paper targets the practical, frequently occurring task of modifying existing code routines rather than generating entire programs from scratch.
  • On-Premise and Data Privacy: The methodology is explicitly designed around using a local, on-premise LLM to ensure that sensitive industrial source code never leaves the company's secure environment.
  • Pragmatism: The inclusion of a rule-based validator acknowledges the current unreliability of LLMs and provides a practical mechanism to ensure the quality and correctness of the final output.

4. Methodology

4.1. Principles

The core idea of the paper is to test the hypothesis that a large, general-purpose LLM, despite having limited or no explicit training on a niche DSL like RAPID, can perform meaningful code modification tasks. The intuition is that RAPID shares structural and syntactic similarities with other programming languages (e.g., Pascal) present in the LLM's vast pre-training data. By leveraging this latent knowledge through carefully crafted few-shot prompts, the model can recognize patterns and apply transformations correctly without needing to be an "expert" in the specific language. This approach is paired with a deterministic, rule-based validator to catch errors and ensure the generated code adheres to strict industrial standards.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology can be broken down into four main components: model selection, task design, prompt engineering, and output validation.

4.2.1. Model Selection

The authors chose the Llama 3.1 70B model. The rationale for this choice includes:

  • Large Parameter Size: A 70-billion parameter model is expected to have strong capabilities in understanding code syntax, semantics, and language patterns, which is crucial for the task.
  • On-Premise Feasibility: This model can be run for inference on a single high-end GPU (like an NVIDIA A100), making it feasible for an SME to set up their own local inference server without requiring a massive hardware cluster. This directly addresses the critical requirement of data privacy.
  • Open-Weight Model: As an open-weight model, it provides flexibility for local deployment and avoids reliance on proprietary, cloud-based APIs, which might have usage restrictions or privacy concerns.

4.2.2. Task Design

To systematically evaluate the LLM's capabilities, the authors designed three code modification tasks with increasing complexity. These tasks were chosen in collaboration with their industrial partner, AKE Technologies, to reflect common and recurring needs in their development workflow.

  1. Task 1: Modifying routine arguments (Low Complexity):

    • Description: This task involves changing a specific argument (e.g., velocity) to a new value across all instructions within a given RAPID routine.
    • Example (from Table 2): The user prompt is "Usevelocityvelocity2.""Use velocity velocity_2.". The LLM must find all instances of velocity1velocity_1 in the input routine and replace them with velocity2velocity_2.
    • Why it's low complexity: This is primarily a pattern recognition and substitution task, similar to a "find and replace" operation that respects the code's structure.
  2. Task 2: Adding an Offset instruction (Medium Complexity):

    • Description: This task requires adding a new movement instruction that applies a positional offset to an existing position.
    • Example (from Table 3): The user prompt might be "Add an intermediate movement using Relative Tool after the start position, with 200 on the X-Axis". The LLM must insert a new MoveJ instruction with an RelTool() function call in the correct sequence within the routine.
    • Why it's medium complexity: This goes beyond simple substitution. The model must understand where to insert the new code, construct a valid new instruction using information from the prompt and the existing code, and not alter the original instructions.
  3. Task 3: Reversing a routine (High Complexity):

    • Description: This task involves generating the logical reverse of a given movement routine.
    • Example (from Table 4): A routine moving from position1 to position2 must be rewritten to move from position2 to position1.
    • Why it's high complexity: This is not a simple reordering of lines. The model must:
      • Reverse the order of all movement instructions.
      • Update the routine's header (e.g., PROC mvid1_id2 becomes PROC mvid2_id1).
      • Rewrite comments to reflect the new start and end positions.
      • Handle special cases defined by AKE Technologies, such as adding or removing intermediate instructions when moving to or from a "HOME" position. This requires deeper logical reasoning about the entire procedure.

4.2.3. Prompt Engineering

The authors use a few-shot prompting strategy. Each prompt consists of two parts:

  • System Prompt: This prompt is provided to the LLM at the beginning of the conversation to set the context and rules. It includes:

    • A persona instruction (e.g., "You are an expert in robot programming.").
    • General rules and formatting guidelines provided by AKE Technologies.
    • Examples of correct syntax for RAPID instructions.
    • One or more examples of the complete task (input and expected output) to demonstrate the desired transformation pattern. The paper provides an excerpt of a system prompt, shown in Listing 3. Youareanexpertinrobotprogramming.GiveonlytheOuTPut,nofurtherexplanations.Thisishowtoformulateamovementinstruction:MoVementTYPETARGETPOSITION,VELOCITY,ZONE,TOOL\WObj:=WORKOBJECT;EXAMPLE:MoveJpR7400,vR7rapid,z50,toR7active\WObj:=woR7Base;Donotaddanyadditionalinstructions.IfthemovementtypeisMachineTending,youmustaddMachineTendingIDasfollows:MovementTYPEID,TARGETPOSITION,VELOCITY,ZONE,TOOL\WObj:=WORKOBJECT;EXAMPLE:MTMoveJ400,pR7400,vR7rapid,z50,toR7active\WObj:=woR7Base;Thefirstmovementinstructioninamovementroutinealwayshasrapidvelocity,activetool,andbasewObj.[EXAMPLES] - You are an expert in robot programming. - Give only the OuTPut, no further explanations. - This is how to formulate a movement instruction: MoVement_TYPE TARGET_POSITION, VELOCITY, ZONE, TOOL\WObj:=WORK_OBJECT; EXAMPLE: MoveJ pR7_400,vR7_rapid,z50,toR7_active\WObj :=woR7_Base; - Do not add any additional instructions. - If the movement type is Machine_Tending, you must add Machine_Tending_ID as follows: Movement_TYPE ID,TARGET_POSITION, VELOCITY, ZONE, TOOL\WObj: = WORK_OBJECT; EXAMPLE: MT_MoveJ 400, pR7_400,vR7_rapid,z50, toR7_active\WObj: = woR7_Base; - The first movement instruction in a movement routine always has rapid velocity, active tool , and base wObj. [EXAMPLES]
  • User Prompt: This contains the specific input for the task at hand, including the original RAPID code and the natural language instruction for the modification (e.g., "Use velocity velocity_2.").

    The study also experimented with using both English and German for the prompts, as the source code comments were in German, to see if the prompt language affected performance.

4.2.4. Output Validation

Recognizing that LLM outputs can be inconsistent or incorrect, the authors developed a custom rule-based validator. This validator automatically checks the generated RAPID code against a set of rules derived from AKE Technologies' coding standards and the specific task requirements. This step is critical for ensuring the reliability of the system in an industrial setting. The validator can detect a wide range of errors, as detailed in Table 5 of the paper, including:

  • Argument Errors: Checking if the correct argument was modified as requested.

  • Structural Errors: Verifying that no unintended code was added, removed, or changed.

  • Logical Errors: For the reversing task, ensuring the source/destination swap and instruction order are correct.

  • Formatting Errors: Making sure identifiers and default values adhere to company standards.

    This validation mechanism enables an important practical workflow: if an output fails validation, it is discarded, and the LLM can be prompted again until a valid output is generated.

5. Experimental Setup

5.1. Datasets

  • Source: The data was sourced from 466 project backup files from 75 unique projects at AKE Technologies, an industrial partner specializing in production systems.

  • Characteristics: The dataset consists of RAPID source code for programming robotic arms. A notable characteristic is the mix of languages: RAPID keywords and identifiers are in English, while code comments are in German. The source code is highly structured and modular, following strict company guidelines.

  • Scale and Preprocessing: From the backups, the authors extracted 18,995 procedures, which were categorized as follows:

    • Simple movement routines: 12,196
    • Complex movement routines: 3,293
    • Other procedures: 3,506 The study focuses on a subset of simple movement routines between two Pre-Positions (passing through positions), totaling 1,731 examples. This focus was chosen because these routines have a consistent structure, making it feasible to build a reliable rule-based validator. Of these, 11 were used for crafting prompts, and 1,720 were used for testing.
  • Data Sample Example: The paper provides examples of simple and complex routines.

    • Listing 1: Simple movement routine example:
      PROC mvid1_id2() !From: Start Position !To: End Position 
      MoveJ id1,positionl,velocity,zone,tool\ wObj: := world_object\NoMove; 
      MoveJ id2,position2,velocity,zone,tool\ wObj: := world_object;   
      ENDPROC
      
    • Listing 2: Complex movement routine example:
      PROC special_case_routine() 
      MoveJ Offs(position1,0,0,100), velocity, zone, tool; 
      MoveL position1, velocity, zone, tool; 
      Stop; 
      MoveL Offs(position1,0,0,50), velocity, zone, tool; 
      MoveJ Offs(position2,0,0,50), velocity, zone, tool; 
      MoveL position2, velocity, zone, tool; 
      Stop; 
      MoveL Offs(position2,0,0,50), velocity, zone, tool; 
      MoveJ Offs(position3,0,0,50), velocity, zone, tool; 
      MoveL position3, velocity, zone, tool; 
      Stop; 
      MoveL Offs(position3,0,0,50), velocity, zone, tool; 
      MoveJ Offs(position4,0,0,50), velocity, zone, tool; 
      MoveL position4, velocity, zone, tool;   
      ENDPROC
      
  • Reasoning for Dataset Choice: Using real-world industrial code from AKE Technologies ensures the practical relevance of the study and allows the evaluation to be conducted within the constraints of actual company coding standards.

5.2. Evaluation Metrics

To quantify the LLM's performance, the authors generated ten outputs for each input and used the following metrics:

  1. SuccessRate(i)

    • Conceptual Definition: For a single input ii, this metric measures the percentage of generated outputs (out of ten) that are deemed correct by the custom validator. It reflects the reliability or consistency of the LLM for a given task instance.
    • Mathematical Formula: $ \mathrm{SuccessRate}(i) = \frac{|\text{Correct outputs for input } i|}{|\text{All outputs for input } i|} \times 100% $
    • Symbol Explanation:
      • ii: A single input example from the test set.
      • Correct outputs for input i|\text{Correct outputs for input } i|: The number of generated outputs for input ii that passed the validation checks.
      • All outputs for input i|\text{All outputs for input } i|: The total number of outputs generated for input ii, which is 10 in this study.
  2. Frequency(x)

    • Conceptual Definition: This metric counts how many input examples in the entire dataset achieved a specific success rate xx. It is used to create the histograms in Figure 2, showing the distribution of reliability across all test cases.
    • Mathematical Formula: $ \mathrm{Frequency}(x) = |{i \in \text{Inputs} | \mathrm{SuccessRate}(i) = x}| $
    • Symbol Explanation:
      • xx: A specific success rate value (e.g., 100%, 90%, ..., 0%).
      • Inputs\text{Inputs}: The set of all input examples in the test data.
      • {}|\{\ldots\}|: The number of inputs ii for which the SuccessRate(i) is equal to xx.
  3. Accuracy

    • Conceptual Definition: This is the primary top-level metric. It is defined as the percentage of input examples for which the LLM was able to generate at least one correct output in the ten attempts. This metric reflects the overall capability of the system (LLM + validator + retry mechanism) to eventually solve the task.
    • Mathematical Formula: $ \mathrm{Accuracy} = \frac{|{i \in \text{Inputs} | \mathrm{SuccessRate}(i) > 0}|}{|\text{Inputs}|} \times 100% $
    • Symbol Explanation:
      • Inputs\text{Inputs}: The set of all input examples in the test data.
      • {iInputsSuccessRate(i)>0}|\{i \in \text{Inputs} | \mathrm{SuccessRate}(i) > 0\}|: The count of input examples for which at least one of the ten generated outputs was correct.
      • Inputs|\text{Inputs}|: The total number of input examples in the test data.

5.3. Baselines

The paper does not use traditional baseline models for comparison (e.g., comparing Llama 3.1 to GPT-4 or other models). The experiment is designed as a feasibility study to answer whether a single, representative, off-the-shelf LLM can perform these tasks. The implicit baseline is the current manual process performed by human developers. The main comparisons drawn in the paper are:

  • Across Tasks: Comparing the LLM's performance on tasks of low, medium, and high complexity.
  • Across Prompt Languages: Comparing the results of using English prompts versus German prompts. In the discussion, a non-LLM, rule-based static analysis approach is considered as a conceptual alternative.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate a clear relationship between task complexity and LLM performance. The system (LLM with a retry mechanism) is highly effective for simpler tasks but becomes less reliable as logical complexity increases. Using English for prompting consistently provides better or equal performance compared to German, especially for complex tasks.

The following are the results from Table 6 of the original paper:

Task Accuracy
German prompts English prompts
Arguments Modification 99.71% 99.36%
Adding an Offset 91.86% 91.97%
Reversing 77.27% 83.72%
  • Argument Modification (Low Complexity): The LLM achieved near-perfect accuracy (>99%) with both English and German prompts. This confirms that the model can handle simple, syntax-aware "find and replace" operations with high reliability.
  • Adding an Offset (Medium Complexity): Accuracy was high at ~92%. This shows the model can perform structural modifications like inserting new code, but the drop from 99% indicates that it is a more challenging task. The need for multiple attempts becomes more apparent here.
  • Reversing a Routine (High Complexity): Accuracy dropped significantly to 83.72% with English prompts and even lower to 77.27% with German prompts. This highlights the LLM's limitations in handling tasks that require deeper logical reasoning, sequence manipulation, and adherence to complex, context-dependent rules (like the "HOME" position logic). The superior performance of English prompts suggests that the model's core reasoning capabilities are better aligned with English, the predominant language in its training data.

6.2. Data Presentation (Tables & Figures)

To provide concrete examples of the tasks, the paper includes Tables 2, 3, and 4.

The following is the content from Table 2: Modifying routine arguments example:

User prompt: "Use velocity velocity_2."
Input Output
PROC mvid1_id2()

!From: Start Position !To: End Position MoveJ id1,position1,velocity_1,zone,tool\WObj :=world_object\NoMove; MoveJ id2,position2,velocity_1,zone,tool\wObj :=world_object; ENDPROC

PROC mvid1_id2()
!From: Start Position
!To: End Position
MoveJ id1,position1,velocity_2,zone,tool\wObj :=world_object\NoMove;
MoveJ id2,position2,velocity_2,zone,tool\wObj :=world_object;
ENDPROC

The following is the content from Table 3: Adding an Offset instruction to a routine example:

User prompt: "Add an intermediate movement using Relative Tool after the start position, with 200 on the X-Axis"
Input Output
PROC mvid1_id2()

!From: Start Position !To: End Position MoveJ id1,position1,velocity,zone,tool\wObj:=world_object\NoMove; MoveJ id2,position2,velocity,zone,tool\wObj:=world_object; ENDPROC

PROC mvid1_id2()
!From: Start Position
!To:    End Position
MoveJ id1,positionl,velocity,zone,tool\wObj:=world_object\NoMove;
MoveJ id_intermediate,RelTool(position1,0,100,0,...),velocity,zone,tool\wObj:=world_object\NoMove;
MoveJ id2,position2,velocity,zone,tool\wObj:=world_object;
ENDPROC

The following is the content from Table 4: Reversing a routine example:

Input Output
PROC mvidf1_idf2()

!From: Start Position !To: End Position MoveJ id1,position1,velocity,zone,tool\wObj:= world_object\NoMove; MoveJ id2,position2,velocity,zone,tool\wObj:= world_object; ENDPROC

PROC mvid2_id1()
!From: End Position
!To: Start Position
MoveJ id2,position2,velocity,zone,tool\wObj:= world_object\NoMove;
MoveJ id1,position1,velocity,zone,tool\wObj:= world_object;
ENDPROC

The following are the validation checks from Table 5: Possible mistakes which can be detected by the validator:

Task Mistakes Description
Argument Modification Wrong ARGUMENT1 After the modification, the in-prompt specified ARG was wrongly changed or unchanged in one or more instructions.
KEY2 was changed Something changed in the definition of the routine
Adding Offset No Offset No offset instruction was added
Instruction changed One or more arguments in the instruction were changed
Wrong position The offset was not applied to the specified position
Wrong function There are two offset functions: Offs() and RelTool(), and the model did not apply the function specified by the user.
KEY2 was changed Something changed in the definition of the routine
Reversing Wrong reverse logic Source and destination were not swapped or the order of instructions was not correctly reversed
Leaving HOME wrongly Routine should have an intermediate instruction when leaving HOME
Returning HOME wrongly Unnecessary intermediate instruction before returning HOME was added
Wrong movement type The movement type of the instructions was not adjusted in special cases (ex: Cases involving a combination of linear and non-linear movements)
Mismatching types After reversing the model wrongly changed the movement type of one or more instructions (in non-special cases)
Wrong ID Wrong ID for the intermediate instructions.
Wrong position Wrong position for the intermediate instructions.
All Tasks NoMove instruction The first instruction did not have a NoMove parameter
Invalid IDENTIFIER3 If the IDENTIFIER's formatting does not adhere to AKE's standard
More instructions More than one instruction was added.
Less instructions One or more instructions were removed from input routine.
More routines The generated routines are more than the input routines.
Less routines The generated routines are less than the input routines.
Wrong default values4 After modification the first instruction did not have default values.

6.2.1. Analysis of Error Types (Figure 1)

The following figure (Figure 1 from the original paper) shows the types of mistakes detected for inputs where the LLM failed on all ten attempts.

Figure 1: The mistakes detected by the customized validator for the results of each task 该图像是图表,展示了每个任务中通过定制验证器检测到的错误类型与数量,包括参数修改、添加偏移和反转操作中的各种错误。不同语言(德语与英语)的错误数量进行了比较。

  • Argument Modification: The errors are minimal, mainly occurring in special cases where long position names cause the LLM to incorrectly alter other arguments.
  • Adding Offset: The most common errors were the LLM unnecessarily modifying arguments in the original instructions or adding extra, unwanted instructions. This suggests a tendency to over-edit when asked to insert code.
  • Reversing: The most frequent errors relate to incorrectly handling the special logic for the HOME position (either failing to add a required intermediate step or adding an unnecessary one). This confirms that the model struggles most with complex, conditional logic that deviates from a simple instruction reordering pattern.

6.2.2. Analysis of Success Rate Distribution (Figure 2)

The following figure (Figure 2 from the original paper) compares the frequency of different success rates for each task.

Figure 2: Comparison of success rates in English and German for each task 该图像是图表,展示了在三项任务中,德语和英语的成功率比较。任务包括参数修改(a)、添加偏移(b)和逆转(c),策略随成功率的变化而变化,整体表现显示在不同成功率下的计数分布。

This chart reveals the importance of the retry mechanism.

  • Argument Modification (a): With English prompts, the model achieved a 100% success rate (all 10 outputs correct) for ~74% of the inputs. Failures were extremely rare.
  • Adding Offset (b) and Reversing (c): The proportion of inputs with a 100% success rate drops significantly (to ~40% and ~53% respectively with English prompts). However, the number of inputs with a success rate between 10% and 90% is substantial. This means that while the model is not perfectly reliable on the first try for these tasks, it is often capable of producing a correct answer within a few attempts. The Accuracy metric (which only requires one success in ten) is therefore a good measure of the system's potential, while this frequency distribution shows its consistency.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully demonstrates that general-purpose LLMs can be adapted for simple programming tasks in niche, industrial domains without requiring expensive fine-tuning. Through few-shot prompting, an off-the-shelf model like Llama 3.1 70B can perform code modifications in the RAPID language with high accuracy for tasks involving syntactic changes. Performance degrades as the tasks require more complex logical reasoning and structural manipulation, but the overall accuracy remains high enough to be practical when combined with a validation and retry mechanism. The findings suggest that LLMs can serve as valuable, low-investment coding assistants in industrial workflows for well-defined, structured modification tasks.

7.2. Limitations & Future Work

  • Limitations (Threats to Validity): The authors acknowledge several limitations:

    • Generalizability: The study was conducted on RAPID code from a single company (AKE Technologies) with very strict coding guidelines. The results may not generalize to other DSLs or to companies with less structured codebases.
    • Task Scope: The study focuses on specific, relatively simple modification tasks. The approach may not be sufficient for more complex programming challenges that require deep program understanding or global code restructuring.
    • Experimental Setting: The evaluation was a controlled simulation. It did not measure the actual impact on developer productivity in a real-world setting.
  • Future Work: The authors plan to address these limitations by:

    • Integrating the LLM-based automation tool directly into the development process at their partner company.
    • Developing concrete use-case scenarios to measure the real-world impact on developer efficiency.
    • Exploring more complex scenarios, potentially by enabling the LLM to process higher-level documents like production schedules to automatically determine what code modifications are needed.

7.3. Personal Insights & Critique

  • Pragmatism and Accessibility: The most compelling aspect of this paper is its pragmatic approach. The combination of a general-purpose, on-premise LLM with a simple, rule-based validator is an elegant and highly practical solution for SMEs. It lowers the barrier to adopting AI by avoiding the need for deep ML expertise, large datasets for fine-tuning, and expensive cloud services, while simultaneously addressing critical data privacy concerns.
  • The Role of the Validator: The paper implicitly highlights a crucial pattern for applied AI: "LLM for generation, rules for verification." Given that current LLMs are not 100% reliable, a deterministic validator is essential for deploying them in any safety-critical or high-precision domain like industrial automation. This hybrid approach leverages the flexibility of the LLM while ensuring the robustness of a traditional system.
  • Comparison to Alternatives: The discussion comparing the LLM approach to a static analysis solution is insightful. While a static analysis tool might be more reliable once built, the authors correctly point out that it requires specialized expertise (e.g., in compiler design and AST manipulation) that an SME might lack. Prompt engineering, by contrast, is more accessible to domain experts. However, this comparison remains qualitative. An empirical study comparing the development effort and performance of both approaches would have been a valuable addition.
  • Beyond Code Snippets: The authors' reflection on integrating the system into the broader development process is key. The true value of such an assistant is not in rewriting single, short blocks of code, but in automating large-scale, repetitive changes across multiple files or even entire projects. The idea of extending the LLM's context to include planning documents (schedules) points towards a more powerful, next-generation "copilot" for industrial automation, moving from a code modifier to a true process automation assistant.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.