# Understanding LLMs and their Application in Code Generation

In this exercise session, we will interact with different types of LLMs (general models and coders), adjust their parameters, and apply some of the prompt engineering techniques covered in the lecture. The only recommendation is to start with the "Model Inference" section to learn how to run inference on the available models. After completing this section, feel free to explore the five exercises of this notebook in the order you prefer:

- Coding Tasks.
- Tokenizer.
- Temperature.
- Self-consistency.
- Feedback Loop.


## Model Inference

Thanks to AccGPT team, we have API access to some LLMs for this lecture. The idea is to try out the different models available. You can select any model from the provided list of models:

In [1]:
# Groq models
models = ["qwen-2.5-32b", "qwen-2.5-coder-32b", "deepseek-r1-distill-qwen-32b", "deepseek-r1-distill-llama-70b", "llama-3.1-8b-instant", "llama-3.3-70b-versatile"]

The following function implements an API call to the model of your choice, for a concrete question (or _prompt_):

In [2]:
import requests
import json

def model_inference(model: str, prompt: str, temperature: float = 0):
    base_url = "http://cs-513-ml003:3000"
    API_key = "AccGPT-API"
    
    # Health check
    health_response = requests.get(f"{base_url}/health")
    health_status = health_response.json().get("status")
    if health_status != "ok":
        print("Health check failed:", health_response.json())
        return None
    
    # Each model has its own chat template. For Qwen:
    # messages = [
    #     {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    #     {"role": "user", "content": prompt}
    # ]

    # Chat request
    chat_payload = {
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 16000,
        "model": model,
        "temperature": temperature,
        "n": 1 # Groq only supports n=1. Larger values will be set to n=1
    }
    
    chat_response = requests.post(
        f"{base_url}/chat",
        json=chat_payload,
        headers={"X-API-Key": API_key}
    )
    
    response_json = chat_response.json()
    completion = response_json.get("choices", [{}])[0].get("message", {}).get("content", "No response")
    
    return completion

Therefore, interacting with the model is as simple as using a function call. For [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct):

In [3]:
response = model_inference("qwen-2.5-32b", "What is a Coder LLM?")

In [4]:
print(response)

A "Coder LLM" is not a standard term in the field of artificial intelligence or software development, but it can be inferred to refer to a specific application or specialization of a Large Language Model (LLM) in the context of coding or software development.

A Large Language Model (LLM) is a type of artificial intelligence model designed to understand and generate human-like text. These models are trained on vast amounts of text data and can perform a wide range of tasks, such as language translation, text summarization, and answering questions.

When we talk about a "Coder LLM," we might be referring to an LLM that has been specifically trained or fine-tuned to understand and generate code. This could include:

1. **Understanding Code**: The model can comprehend programming languages, understand the structure and logic of code, and possibly even debug or suggest improvements.
2. **Generating Code**: The model can write code based on given specifications or descriptions, which can be

In [5]:
# TODO: Feel free to run some inferences on the models or your choice. Feel free to try some coding tasks too!

Now that we can interact with the models, feel free to jump to any of the exercises depending on your interests.

## Coding Tasks

As discussed in the lecture, there are many types of coding tasks. The goal of this exercise is to explore the various coding tasks that an LLM can perform, such as NL2Code, code completion, code explanation, code refactoring, adding type hints, docstring generation, bug fixing, and code translation.

In this exercise, you are free to define different coding tasks for the models and observe their behavior. Here are some examples:

* Code explanation: "Explain what this Python function does in simple terms."
* Code refactoring: "Rewrite this code to be more efficient."
* Type Hints: "Add type hints to this function."
* Bug Fixing: Introduce an error and ask the models to debug it.

Among the models available, we have the general model [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) and the coder variant [Qwen2.5-Coder-32B](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct). **Do you observe any differences in the generations between those models?**

If you need some inspiration, here are some examples:

### 1. Code Optimization

In [6]:
code_snippet = """
def process_data(data):
    result = []
    for i in range(len(data) - 1):
        if data[i] < data[i + 1]:
            result.append(data[i + 1] - data[i])
    return result
"""

In [7]:
prompt = f"""
Given the following code snippet, refactor the code to be more pythonic.

<code>
{code_snippet}
<\code>
"""

In [8]:
print(model_inference("qwen-2.5-32b", prompt))

Certainly! The provided code can be made more Pythonic by using list comprehensions and leveraging Python's built-in functions for a more concise and readable approach. Here's a refactored version of the code:

```python
def process_data(data):
    return [data[i + 1] - data[i] for i in range(len(data) - 1) if data[i] < data[i + 1]]
```

This version uses a list comprehension to achieve the same functionality in a more concise manner.


In [9]:
print(model_inference("qwen-2.5-coder-32b", prompt))

To make the code more Pythonic, we can use list comprehensions and the `zip` function to simplify the logic. Here's the refactored version:

```python
def process_data(data):
    return [b - a for a, b in zip(data, data[1:]) if b > a]
```

### Explanation:
- **`zip(data, data[1:])`**: This pairs each element with the next one in the list, effectively creating tuples of consecutive elements.
- **List comprehension**: This is a concise way to create lists. It iterates over the pairs created by `zip` and includes the difference `b - a` in the result list only if `b > a`.
- **Readability**: This version is more concise and easier to read, as it directly expresses the intent of the code.


### 2. Bug Fixing

Let's introduce a bug in the function!

In [10]:
buggy_code_snippet = f"""
def process_data(data):
    result = []
    for i in range(len(data)):
        if data[i] < data[i + 1]:
            result.append(data[i + 1] - data[i])
    return result
"""

In [11]:
prompt = f"""
Given the following code snippet, is there any bug?

<code>
{code_snippet}
<\code>
"""

In [12]:
print(model_inference("qwen-2.5-32b", prompt))

The code snippet provided does not contain any syntax errors, but whether it has a "bug" or not depends on the intended functionality and the context in which it is used. Let's analyze the function:

```python
def process_data(data):
    result = []
    for i in range(len(data) - 1):
        if data[i] < data[i + 1]:
            result.append(data[i + 1] - data[i])
    return result
```

This function processes a list of numbers (`data`) and returns a list of differences between consecutive elements where the latter is greater than the former. Here are some points to consider:

1. **Functionality**: The function calculates the difference between consecutive elements only if the next element is greater than the current one. If this is the intended behavior, then the function is correct. However, if the intention was to calculate the difference between all consecutive elements regardless of their order, then the function would need to be modified.

2. **Edge Cases**:
   - If the input li

In [13]:
print(model_inference("qwen-2.5-coder-32b", prompt))

The provided code snippet defines a function `process_data` that processes a list of numbers (`data`) and returns a new list containing the differences between consecutive elements where the next element is greater than the current one. The code itself does not contain any syntax errors, but there are a few considerations and potential issues to be aware of:

1. **Empty List or Single Element List**: If the input list `data` is empty or contains only one element, the function will return an empty list. This is not necessarily a bug, but it might not be the intended behavior depending on the requirements.

2. **Data Type Consistency**: The function assumes that the elements in the list `data` are numbers (integers or floats) that can be compared and subtracted. If the list contains non-numeric types, the function will raise a `TypeError`.

3. **Edge Cases**: The function does not handle edge cases explicitly, such as lists with all elements being the same or lists with decreasing values

Now it is you time to try out other coding tasks. Feel free to modify the prompt as much as you want. **Can you come up with a way comparing the output of the general model vs. the coder moder?**

In [15]:
# TODO: Try different coding tasks and compare the model performance. 
# For easy tasks, models should perform equally, whereas for more complex tasks, coder models typically perform better.

## Tokenizer

In this exercise, the goal is to explore the tokenizer component.

The tokenizer plays a crucial role since it has a direct impact on the model input.

In fact, there are some completion errors that occur due to the tokenizer and not because of the modelâ€™s capabilities. For example, the tokenizer is the reason why LLMs are not good at following character restrictions in general. LLMs do not see characters, they count in tokens!

The goal of this exercise is to try the different encodings on code snippets. **Do you observe any differences in the encoding used for coding models vs. the encodings used for natural language?**

Given a prompt and an encoding, we can observe the numerical IDs of the corresponding tokens using the `encode()` method. You can try any of the encodings available at the tiktoken library (check the slides if needed).

As an example, you can use the encoding for GPT-4 as follows:

In [16]:
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")

In [17]:
prompt = " " # TODO: Define your desired prompt here
encoded_prompt = encoding.encode(prompt)
print(encoded_prompt)

[220]


We can use the `decode()` method to convert the prompt from numerical IDs to words:

In [18]:
encoding.decode(encoded_prompt)

' '

But, how does tokens look like?

In [19]:
[encoding.decode_single_token_bytes(token) for token in encoded_prompt]

[b' ']

### Comparing Encodings

You can use the following function to compare the three different condings present in the table above. Try short code snippets and observe the differences. Codex is the coder model behind [GitHub Copilot](https://github.com/features/copilot), **do you see any differences in the embedding of coder models and general models?**

In [20]:
def compare_encodings(example_string: str) -> None:
    """Prints a comparison of three string encodings."""
    print(f'\nExample string: "{example_string}"')
    # for each encoding, print the # of tokens, the token integers, and the token bytes
    for encoding_name in ["gpt2", "p50k_base", "cl100k_base"]:
        encoding = tiktoken.get_encoding(encoding_name)
        token_integers = encoding.encode(example_string)
        num_tokens = len(token_integers)
        token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]
        print()
        print(f"{encoding_name}: {num_tokens} tokens")
        print(f"token integers: {token_integers}")
        print(f"token bytes: {token_bytes}")

**HINT:** Pay special attention to the encoding of white-spaces and function headers.

In [21]:
prompt = """

""" # TODO: Define your desired prompt here

In [22]:
compare_encodings(prompt)


Example string: "

"

gpt2: 1 tokens
token integers: [628]
token bytes: [b'\n\n']

p50k_base: 1 tokens
token integers: [628]
token bytes: [b'\n\n']

cl100k_base: 1 tokens
token integers: [271]
token bytes: [b'\n\n']


## Temperature

In this exercise, the goal is to experiment with the temperature parameter and, consequently, the randomness in the output. The default temperature value in `model_inference()` is 0, meaning the model always selects the token with the highest probability.

Try running inferences with different temperature values. **Do you notice any randomness in the output?** Try running the same prompt multiple times with a low temperature value.

In [23]:
code_snippet = """
def count_unique_words(text):
    words = text.split()
    unique_words = []
    for word in words:
        if word not in unique_words:
            unique_words.append(word)
    return len(unique_words)

"""

In [24]:
prompt = f"""
The following Python function is inefficient. 
Optimize it.

<code>
{code_snippet}
<\code>
"""

In [25]:
response_low = model_inference("qwen-2.5-coder-32b", prompt, 0.1)

print("Low Temperature:\n\n", response_low)

Low Temperature:

 The current implementation of the `count_unique_words` function is inefficient because it uses a list to store unique words and checks for membership in the list using the `in` operator, which has a time complexity of O(n) for each check. This results in an overall time complexity of O(n^2) for the function.

To optimize this function, you can use a set to store unique words. Sets in Python provide average O(1) time complexity for membership checks and insertions, which will significantly improve the performance of the function. Here's the optimized version:

```python
def count_unique_words(text):
    words = text.split()
    unique_words = set(words)
    return len(unique_words)
```

This version of the function splits the text into words and directly adds them to a set, which automatically handles duplicates. Finally, it returns the number of unique words by getting the length of the set. This approach is more efficient with a time complexity of O(n).


In [26]:
response_high = model_inference("qwen-2.5-coder-32b", prompt, 0.9)

print("High Temperature Response:\n\n", response_high)

High Temperature Response:

 The given function can be optimized by using a set to keep track of unique words. Sets in Python are implemented as hash tables, which allows for average-case constant time complexity for insertions and lookups. This makes them more efficient for checking the presence of an element compared to a list.

Here's the optimized version of the function:

```python
def count_unique_words(text):
    words = text.split()
    unique_words = set(words)
    return len(unique_words)
```

In this optimized version:
- We split the text into words as before.
- We then convert the list of words into a set, which automatically filters out duplicates.
- Finally, we return the length of the set, which represents the count of unique words.

This approach is more efficient, especially for longer texts, because it reduces the time complexity from O(n^2) to O(n) on average, where n is the number of words in the text.


## Self-consistency

Self-consistency encourages the model to explore different reasoning paths by generating multiple responses to the same prompt. To achieve this, we need a new inference method that calls a different API behind the scenes.

The only model available for this purpose is [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), so there is no need to provide the model name as an input for the inference method anymore. The new inference method, `model_inference_multiple()`, accepts three parameters:

* `temperature`: Controls the randomness of the responses.
* `n`: Specifies the number of responses to generate.
* `prompt`: The input prompt for the model.

In [64]:
def model_inference_multiple(temperature: float, n: int, prompt: str):
    VLLM_ENDPOINT = "http://lbllm.cern.ch:8000/v1/chat/completions"
    VLLM_MODEL = "/models/Meta-Llama-3.1-8B-Instruct"

    headers = {"Content-Type": "application/json"}
    messages = [{"role": "user", "content": prompt}]
    data = {
        "model": VLLM_MODEL,
        "messages": messages,
        "max_tokens": 1000,
        "temperature": temperature,
        "n": n
    }

    response = requests.post(VLLM_ENDPOINT, json=data, headers=headers)
    response.raise_for_status()
    return response.json()

Let's generate 4 samples with `temperature = 0.9`:

In [56]:
response = model_inference_multiple(0.9, 4, "What is a Coder LLM? Keep it short.")

for i, choice in enumerate(response.get("choices", [])):
    print(f"Response {i+1}:", choice.get("message", {}).get("content", "No response received."), "\n")

Response 1: A Coder LLM (Large Language Model) is a type of AI model that combines natural language processing (NLP) with coding capabilities. It can understand and generate code in various programming languages, similar to how a human coder thinks and writes code. These models can be used for tasks like code completion, code reviews, and even generating code from natural language descriptions. 

Response 2: A Coder LLM (Large Language Model) is a type of artificial intelligence designed to write code (e.g., in languages like Python or Java) and assist with software development. It uses natural language processing and machine learning to generate code, debug, and offer suggestions to developers. 

Response 3: A Coder LLM, also known as a Code LLM or Code-centric LLM, is a type of Large Language Model (LLM) specifically designed to generate, complete, debug, or refactor code in a particular programming language or domain. Coder LLMs leverage AI to assist developers with writing, underst

As you can observe, different generations with a high temperature value lead to notable differences in the outputs.

### Majority Voting

We have seen that one way to process all the generated samples and converge to one single solution is the _majority voting_ approach. Implement the majority voting approach to the following tasks:

1. Reverse a word:

In [106]:
prompt = """
Reverse the word "lollipop". Output the reversed word only.
"""

In [107]:
response = model_inference_multiple(0.4, 10, prompt)

result_list = []
for i, choice in enumerate(response.get("choices", [])):
    result = choice.get("message", {}).get("content", "No response received.")
    print(f"Response {i+1}:", result, "\n")
    result_list.append(result)

Response 1: pilolloL 

Response 2: pilolloL 

Response 3: pilolloL 

Response 4: pilopoll. 

Response 5: pilloplol 

Response 6: pilolloL 

Response 7: pilppol. 

Response 8: pippolol 

Response 9: pilolpol. 

Response 10: pilloplol 



In [27]:
# TODO: Implement the majority voting approach discussed during the lecture

2. Compute the factorial of a number:

In [None]:
prompt = """
Provide the result of computing the factorial of 24. Output the final result only.
In python, it would be computed as math.factorial(24).
"""

In [None]:
# TODO: Run the majority voting approach on this new example.

You can check the correct result with the following Python code:

In [18]:
import math

result = math.factorial(24)
print(result)

620448401733239439360000


3. An example of your choice:

In [19]:
# TODO: Define a coding task, generate multiple samples and run the majority voting.

## Feedback Approaches

Finally, we will implement some sort of feedback loop using two different models. This is also known as [model-as-a-judge](https://huggingface.co/learn/cookbook/llm_judge).

Let's go step by step. Afterwards, the goal will be to close the feedback loop. 

1. First, we define a coding task and prompt a general model to solve it:

In [2]:
task_prompt = """
Translate the given code snippet from C to Rust. Output the translated code only.

<code>
#include <stdio.h>
#include <string.h>

void reverse_string(char *str) {
    int left = 0, right = strlen(str) - 1;
    while (left < right) {
        char temp = str[left];
        str[left] = str[right];
        str[right] = temp;
        left++;
        right--;
    }
}

int main() {
    char str[] = "hello";
    reverse_string(str);
    printf("%s\n", str);
    return 0;
}

<\code>
"""

In [6]:
translation = model_inference("qwen-2.5-32b", task_prompt)

In [7]:
print(translation)

```rust
fn reverse_string(s: &mut [u8]) {
    let (mut left, mut right) = (0, s.len() - 1);
    while left < right {
        s.swap(left, right);
        left += 1;
        right -= 1;
    }
}

fn main() {
    let mut str = b"hello";
    reverse_string(&mut str);
    println!("{}", String::from_utf8_lossy(str));
}
```


2. Then, we can use an instance of the coder model to provide feedback on the task:

In [20]:
debug_prompt = f"""
You are an expert in translating from C-to-Rust.
Your task is to provide feedback on a translation task.

Do not provide the correct translation, only the feedback.

Given the following C code:
<task>
{task_prompt}
</task>

The proposed translation was:
<translation>
{translation}
</translation>
"""

In [21]:
feedback = model_inference("qwen-2.5-coder-32b", debug_prompt)

In [22]:
print(feedback)

The provided Rust translation is a good start, but there are a few points to consider for better adherence to Rust's safety and idiomatic practices. Here are some feedback points:

1. **String Initialization**:
   - The use of `String::from("hello").into_bytes()` is correct for creating a mutable vector of bytes. However, it's worth noting that this approach converts the string into a vector of bytes, which is necessary for in-place modification. If the string is guaranteed to be ASCII or UTF-8, this is fine.

2. **Function Signature**:
   - The function `reverse_string` takes a mutable slice of bytes (`&mut [u8]`). This is appropriate for modifying the string in place, but it's important to ensure that the input is indeed a valid UTF-8 byte sequence before and after the operation.

3. **Printing the String**:
   - The use of `String::from_utf8_lossy(&str)` is correct for converting the byte slice back to a string, handling any invalid UTF-8 sequences gracefully. However, if the input 

3. Finally, it is your turn to provide the feedback of the coder model to the general model to improve the original translation.

In [23]:
# TODO: Close the feedback loop by providing the coder feedback to the general model.

There are other methods to provide feedback to the LLM, without relying on another model. As we have seen in the lectures, feedback can come from the compiler too.

In this last exercise, we will target C-to-Rust code transpilation. Given a set of C functions, the goal is to ask a coder model to transpile the given function and compile it. If compilation fails, the compiler error message should be forwarded to the LLM in order to auto-correct the original transpilation.

We will consider that transpilation was unsuccessful if the LLM does not manage to generate a compilable code within 5 iterations. **How many functions does the LLM manage to transpile correctly?**

Below you can find helper functions to interact with the Rust compiler and to create and compile Rust projects.



**Note:** The process of transforming and compiling code between programming languages is known as _transpilation_.

In [28]:
!cargo --version

cargo 1.78.0 (54d8815d0 2024-03-26)


In [34]:
function_dict = {
    "print_doors_status": """
void print_doors_status(char is_open[], int n_doors) {
    int door;
    for (door = 0; door < n_doors; ++door)
        printf("door #%d is %s.\\n", door + 1, (is_open[door] ? "open" : "closed"));
}
""",
    "count_greater_than": """
int count_greater_than(int arr[], int size, int value) {
    int count = 0, i;
    for (i = 0; i < size; i++) {
        if (arr[i] > value) count++;
    }
    return count;
}
""",
    "sum_array": """
int sum_array(int arr[], int size) {
    int sum = 0, i;
    for (i = 0; i < size; i++) {
        sum += arr[i];
    }
    return sum;
}
""",
    "find_max": """
int find_max(int arr[], int size) {
    int max = arr[0], i;
    for (i = 1; i < size; i++) {
        if (arr[i] > max) {
            max = arr[i];
        }
    }
    return max;
}
""",
    "reverse_array": """
void reverse_array(int arr[], int size) {
    int temp, start = 0, end = size - 1;
    while (start < end) {
        temp = arr[start];
        arr[start] = arr[end];
        arr[end] = temp;
        start++;
        end--;
    }
}
""",
    "is_sorted": """
int is_sorted(int arr[], int size) {
    int i;
    for (i = 1; i < size; i++) {
        if (arr[i] < arr[i - 1]) {
            return 0;  // False if not sorted
        }
    }
    return 1;  // True if sorted
}
""",
    "print_array": """
void print_array(int arr[], int size) {
    int i;
    for (i = 0; i < size; i++) {
        printf("%d ", arr[i]);
    }
    printf("\\n");
}
""",
    "multiply_array": """
int multiply_array(int arr[], int size) {
    int product = 1, i;
    for (i = 0; i < size; i++) {
        product *= arr[i];
    }
    return product;
}
""",
    "find_min": """
int find_min(int arr[], int size) {
    int min = arr[0], i;
    for (i = 1; i < size; i++) {
        if (arr[i] < min) {
            min = arr[i];
        }
    }
    return min;
}
""",
    "print_fibonacci": """
void print_fibonacci(int n) {
    if (n <= 0) return;
    printf("%d ", 0);
    if (n > 1) {
        int a = 0, b = 1, i;
        for (i = 2; i < n; i++) {
            int next = a + b;
            printf("%d ", next);
            a = b;
            b = next;
        }
    }
    printf("\\n");
}
""",
    "reverse_string_recursively": """
void reverse_string_recursively(char* str, int start, int end) {
    if (start >= end) return;
    char temp = str[start];
    str[start] = str[end];
    str[end] = temp;
    reverse_string_recursively(str, start + 1, end - 1);
}
""",
    "count_vowels": """
int count_vowels(char* str) {
    int count = 0;
    while (*str) {
        if (*str == 'a' || *str == 'e' || *str == 'i' || *str == 'o' || *str == 'u' || 
            *str == 'A' || *str == 'E' || *str == 'I' || *str == 'O' || *str == 'U') {
            count++;
        }
        str++;
    }
    return count;
}
""",
    "swap_using_pointers": """
void swap_using_pointers(int* a, int* b) {
    int temp = *a;
    *a = *b;
    *b = temp;
}
""",
    "allocate_2d_array": """
int** allocate_2d_array(int rows, int cols) {
    int i;
    int** arr = (int**)malloc(rows * sizeof(int*));
    for (i = 0; i < rows; i++) {
        arr[i] = (int*)malloc(cols * sizeof(int));
        for (int j = 0; j < cols; j++) {
            arr[i][j] = 0;  // Initializing all elements to 0
        }
    }
    return arr;
}
""",
    "find_first_occurrence": """
char* find_first_occurrence(char* str, char ch) {
    while (*str) {
        if (*str == ch) {
            return str;
        }
        str++;
    }
    return NULL;
}
""",
    "merge_sorted_arrays": """
int* merge_sorted_arrays(int* arr1, int size1, int* arr2, int size2) {
    int* result = (int*)malloc((size1 + size2) * sizeof(int));
    int i = 0, j = 0, k = 0;
    
    while (i < size1 && j < size2) {
        if (arr1[i] < arr2[j]) {
            result[k++] = arr1[i++];
        } else {
            result[k++] = arr2[j++];
        }
    }
    
    while (i < size1) {
        result[k++] = arr1[i++];
    }
    
    while (j < size2) {
        result[k++] = arr2[j++];
    }
    
    return result;
}
""",
    "is_prime": """
int is_prime(int n) {
    if (n <= 1) return 0;
    for (int i = 2; i * i <= n; i++) {
        if (n % i == 0) return 0;
    }
    return 1;
}
""",
    "power_recursive": """
int power_recursive(int base, int exponent) {
    if (exponent == 0) return 1;
    return base * power_recursive(base, exponent - 1);
}
""",
    "remove_duplicates": """
int remove_duplicates(int arr[], int size) {
    if (size == 0) return 0;
    int unique_count = 1;
    
    for (int i = 1; i < size; i++) {
        if (arr[i] != arr[i - 1]) {
            arr[unique_count++] = arr[i];
        }
    }
    
    return unique_count;
}
""",    
    "toggle_doors": """
void toggle_doors(char is_open[], int n_doors) {
    int pass, door;
    for (pass = 0; pass < n_doors; ++pass)
        for (door = pass; door < n_doors; door += pass + 1)
            is_open[door] = !is_open[door];
}
"""
}

In [50]:
from pathlib import Path
import subprocess
import shutil
import os

def remove_existing_project(project_path: Path):
    """Removes the Cargo project directory if it already exists."""
    if project_path.exists():
        try:
            shutil.rmtree(project_path)
            print(f"Removed existing project at {project_path}")
        except Exception as e:
            print(f"Failed to remove {project_path}: {e}")
    else:
        print(f"Project path {project_path} does not exist.")

def create_cargo_project(project_path: Path, rs_file: Path):
    """Creates a new Cargo project as an executable and adds a .rs file to it."""
    subprocess.run(["cargo", "new", project_path.name, "--bin"], cwd=project_path.parent, check=True)
    print(f"Created new Cargo project at {project_path}")

    main_rs_path = project_path / "src" / "main.rs"  
    shutil.copy(rs_file, main_rs_path)
    print(f"Copied {rs_file} to {main_rs_path}")

def show_project_structure(project_path: Path):
    """Displays the structure of the generated Cargo project, excluding 'target'."""
    print("\nGenerated project structure:")
    for root, dirs, files in os.walk(project_path):
        dirs[:] = [d for d in dirs if d != "target"]
        level = root.replace(str(project_path), "").count(os.sep)
        indent = " " * (4 * level)
        print(f"{indent}{os.path.basename(root)}/")
        subindent = " " * (4 * (level + 1))
        for file in files:
            print(f"{subindent}{file}")

def run_cargo_command(project_path: Path, command: list):
    """Runs a cargo command and captures both stdout and stderr, returning the results."""
    print(f"\nRunning command: {' '.join(command)}")
    result = subprocess.run(command, cwd=project_path, text=True, capture_output=True)
    return result.stdout, result.stderr, result.returncode

def write_rust_translation(rust_code: str, file_path: str):
    """Writes Rust code to a file, trimming first and last lines if necessary."""
    lines = rust_code.strip().splitlines()
    trimmed_lines = lines[1:-1] if len(lines) > 2 else lines

    with open(file_path, "w") as file:
        file.writelines("\n".join(trimmed_lines))

Let's run the first iteration to showcase how to use the helper functions:

In [47]:
c_function = function_dict["print_doors_status"]
    
initial_prompt = f"""
    Translate the given code snippet from C to Rust. Output the code only.

    <code>
    {c_function}
    </code>
    """

In [36]:
print(c_function)


void print_doors_status(char is_open[], int n_doors) {
    int door;
    for (door = 0; door < n_doors; ++door)
        printf("door #%d is %s.\n", door + 1, (is_open[door] ? "open" : "closed"));
}



In [38]:
rust_translation = model_inference("qwen-2.5-coder-32b", initial_prompt).strip()

In [39]:
print(rust_translation)

```rust
fn print_doors_status(is_open: &[bool], n_doors: usize) {
    for door in 0..n_doors {
        println!("door #{} is {}.", door + 1, if is_open[door] { "open" } else { "closed" });
    }
}
```


In [55]:
rust_file_path = "translation.rs"
write_rust_translation(rust_translation, rust_file_path)
project_path = Path("./rust_project")
# Remove, if existing, and create a new Cargo project
remove_existing_project(project_path)

Removed existing project at rust_project


In [56]:
create_cargo_project(project_path, rust_file_path)
show_project_structure(project_path)

[1m[32m    Creating[0m binary (application) `rust_project` package


Created new Cargo project at rust_project
Copied translation.rs to rust_project/src/main.rs

Generated project structure:
rust_project/
    .gitignore
    Cargo.toml
    .git/
        HEAD
        config
        description
        hooks/
            applypatch-msg.sample
            commit-msg.sample
            fsmonitor-watchman.sample
            post-update.sample
            pre-applypatch.sample
            pre-commit.sample
            pre-merge-commit.sample
            pre-push.sample
            pre-rebase.sample
            pre-receive.sample
            prepare-commit-msg.sample
            push-to-checkout.sample
            sendemail-validate.sample
            update.sample
        info/
            exclude
        objects/
            info/
            pack/
        refs/
            heads/
            tags/
    src/
        main.rs


[1m[36mnote[0m[1m:[0m see more `Cargo.toml` keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html


In [59]:
!cat rust_project/src/main.rs

fn print_doors_status(is_open: &[bool], n_doors: usize) {
    for door in 0..n_doors {
        println!("door #{} is {}.", door + 1, if is_open[door] { "open" } else { "closed" });
    }
}

In [61]:
# Attempt to compile
stout, stderr, returncode = run_cargo_command(project_path, ["cargo", "build"])


Running command: cargo build


In [65]:
if returncode != 0:
    print(f"Error occurred: {stderr}")
else:
    print(f"Command successful: {stdout}")

Error occurred: [1m[32m   Compiling[0m rust_project v0.1.0 (/eos/home-i01/a/avalenzu/SWAN_projects/icsc2025/rust_project)
[0m[1m[38;5;9merror[E0601][0m[0m[1m: `main` function not found in crate `rust_project`[0m
[0m [0m[0m[1m[38;5;12m--> [0m[0msrc/main.rs:5:2[0m
[0m  [0m[0m[1m[38;5;12m|[0m
[0m[1m[38;5;12m5[0m[0m [0m[0m[1m[38;5;12m|[0m[0m [0m[0m}[0m
[0m  [0m[0m[1m[38;5;12m| [0m[0m [0m[0m[1m[38;5;9m^[0m[0m [0m[0m[1m[38;5;9mconsider adding a `main` function to `src/main.rs`[0m

[0m[1mFor more information about this error, try `rustc --explain E0601`.[0m
[1m[31merror[0m[1m:[0m could not compile `rust_project` (bin "rust_project") due to 1 previous error



It is your turn now to implement the feedback loop. Remember to set a limit of N trials. The ultimate goal is to implement a loop to target all the c functions provided. **How many can the LLM transpile within N iterations?**

In [None]:
# TODO: Implement feedback loop

#### EXTRA: Unit tests

Feedback can also come from unit tests. You can ask the LLM to generate the translation with some unit tests and run them with the following command:

In [41]:
run_cargo_command(project_path, ["cargo", "test"])


Running command: cargo test


[1m[32m   Compiling[0m rust_project v0.1.0 (/eos/home-i01/a/avalenzu/SWAN_projects/icsc2025/rust_project)



running 1 test
test tests::it_works ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s


running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Command completed successfully.


[1m[32m    Finished[0m `test` profile [unoptimized + debuginfo] target(s) in 6.16s
[1m[32m     Running[0m unittests src/lib.rs (target/debug/deps/rust_project-2e2d0f4e9ca5464c)
[1m[32m   Doc-tests[0m rust_project


In [None]:
# TODO: Try the feedback loop with unit test execution.

#### EXTRA: More complex C functions

Once you are done, I encourage you to try the feedback loop on more complex C functions. [CodeTransOcean](https://arxiv.org/abs/2310.04951) is a code transpilation benchmark with data obtained from competitive programmin websites. The goal is to access this data available in [Hugging Face](https://huggingface.co/datasets/aandvalenzuela/CodeTransOcean) and run the feedback loop approach.
**Do you forsee any ways to augment the LLM prompt to aid in the transpilation?**

In [66]:
# !pip install --user datasets

In [67]:
from datasets import load_dataset

In [69]:
dataset = load_dataset("aandvalenzuela/CodeTransOcean")
data_dict = {i: item for i, item in enumerate(dataset['train'])}  # Convert train split to dict

print(data_dict[0].keys())

filtered_niche_train.jsonl:   0%|          | 0.00/1.15M [00:00<?, ?B/s]

filtered_niche_valid.jsonl:   0%|          | 0.00/295k [00:00<?, ?B/s]

filtered_niche_test.jsonl:   0%|          | 0.00/710k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/538 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/95 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/191 [00:00<?, ? examples/s]

dict_keys(['name', 'C', 'Rust'])


C functions can be accessed as follows:

In [70]:
print(data_dict[0]["C"])

#include <stdio.h>

int main()
{
  char is_open[100] = { 0 };
  int pass, door;

  
  for (pass = 0; pass < 100; ++pass)
    for (door = pass; door < 100; door += pass+1)
      is_open[door] = !is_open[door];

  
  for (door = 0; door < 100; ++door)
    printf("door #%d is %s.\n", door+1, (is_open[door]? "open" : "closed"));

  return 0;
}



In [None]:
# TODO: Run feedback loop on more complex functions.

## Thank you for participating!

Feel free to contact me at: andrea.valenzuela@bsc.es