论文状态：已完成

The Influence of Artificial Intelligence Tools on Learning Outcomes in Computer Programming: A Systematic Review and Meta-Analysis

发表：2025/05/09

计算机编程中的人工智能工具 (1)AI辅助学习效果评估 (1)编程课程学习成效 (1)系统评价与元分析 (1)学生对AI工具的态度 (1)

价格：0.100000

已有 4 人读过

本分析由 AI 生成，可能不完全准确，请以原文为准。

TL;DR 精炼摘要

这项系统综述和荟萃分析探讨了人工智能工具，如ChatGPT和GitHub Copilot，对计算机编程课程学习成果的影响。研究分析了35项2020至2024年的对照研究，结果显示使用AI工具的学生在任务完成时间和表现评分上显著优于未使用者，但在学习成功和理解难易方面没有显著优势。学生普遍对AI工具持积极态度，研究强调定制化的教学策略以优化AI辅助学习的有效性。

摘要

This systematic review and meta-analysis investigates the impact of artificial intelligence (AI) tools, including ChatGPT 3.5 and GitHub Copilot, on learning outcomes in computer programming courses. A total of 35 controlled studies published between 2020 and 2024 were analysed to assess the effectiveness of AI-assisted learning. The results indicate that students using AI tools outperformed those without such aids. The meta-analysis findings revealed that AI-assisted learning significantly reduced task completion time (SMD = −0.69, 95% CI [−2.13, −0.74], I² = 95%, p = 0.34) and improved student performance scores (SMD = 0.86, 95% CI [0.36, 1.37], p = 0.0008, I² = 54%). However, AI tools did not provide a statistically significant advantage in learning success or ease of understanding (SMD = 0.16, 95% CI [−0.23, 0.55], p = 0.41, I² = 55%), with sensitivity analysis suggesting result variability. Student perceptions of AI tools were overwhelmingly positive, with a pooled estimate of 1.0 (95% CI [0.92, 1.00], I² = 0%). While AI tools enhance computer programming proficiency and efficiency, their effectiveness depends on factors such as tool functionality and course design. To maximise benefits and mitigate over-reliance, tailored pedagogical strategies are essential. This study underscores the transformative role of AI in computer programming education and provides evidence-based insights for optimising AI-assisted learning.

思维导图

论文精读

中文精读约 67 分钟读完 · 71,203 字

1. 论文基本信息

1.1. 标题

The Influence of Artificial Intelligence Tools on Learning Outcomes in Computer Programming: A Systematic Review and Meta-Analysis (人工智能工具对计算机编程学习成果的影响：一项系统综述与荟萃分析)

1.2. 作者

Manal Alanazi 1, Ben Soh 1, *, Halima Samra 2, Alice Li 3

Manal Alanazi: 澳大利亚乐卓博大学 (La Trobe University) 计算机科学与信息技术系。
Ben Soh: 澳大利亚乐卓博大学 (La Trobe University) 计算机科学与信息技术系 (通讯作者)。
Halima Samra: 沙特阿拉伯阿卜杜勒阿齐兹国王大学 (King Abdulaziz University) 计算机科学与信息技术系。
Alice Li: 澳大利亚乐卓博大学 (La Trobe University) 乐卓博商学院。

1.3. 发表期刊/会议

Computers 2025, 14, 185 (期刊)

1.4. 发表年份

2025

1.5. 摘要

这篇系统综述 (systematic review) 和荟萃分析 (meta-analysis) 调查了包括 ChatGPT 3.5 和 GitHub Copilot 在内的人工智能 (AI) 工具对计算机编程课程学习成果的影响。研究分析了 2020 年至 2024 年间发表的共 35 项对照研究，以评估 AI 辅助学习的有效性。结果表明，使用 AI 工具的学生表现优于未使用这些辅助工具的学生。荟萃分析 (meta-analysis) 发现，AI 辅助学习显著缩短了任务完成时间 (SMD = -0.69, 95% CI [-2.13, -0.74], I² = 95%, p = 0.34)，并提高了学生表现评分 (SMD = 0.86, 95% CI [0.36, 1.37], p = 0.0008, I² = 54%)。然而，AI 工具在学习成功或理解难易度方面并未提供统计学上的显著优势 (SMD = 0.16, 95% CI [-0.23, 0.55], p = 0.41, I² = 55%)，敏感性分析 (sensitivity analysis) 表明结果存在变异性。学生对 AI 工具的看法普遍积极，汇总估计值为 1.0 (95% CI [0.92, 1.00], I² = 0%)。虽然 AI 工具能增强计算机编程能力和效率，但其有效性取决于工具功能和课程设计等因素。为了最大限度地发挥效益并减少过度依赖，量身定制的教学策略至关重要。本研究强调了 AI 在计算机编程教育中的变革作用，并为优化 AI 辅助学习提供了循证见解。

1.6. 原文链接

/files/papers/69399122983c0e5b0e997cb5/paper.pdf (已正式发表)

2. 整体概括

2.1. 研究背景与动机

计算机编程 (computer programming) 因其复杂的语法、逻辑推理和问题解决特性，往往给学生，尤其是在入门课程中的学生带来巨大挑战。许多学生难以将理论编程概念与其在代码中的实际实现联系起来。传统的教学方法可能难以提供个性化关注。

人工智能 (AI) 已成为教育领域的变革力量，通过个性化和自动化提供增强学习的创新方法。特别是像 ChatGPT 这样的对话代理和 GitHub Copilot 这样的代码自动完成工具，因其潜力而受到广泛关注。智能辅导系统 (Intelligent Tutoring Systems, ITSs) 长期以来被用于创建个性化学习体验，但近年来，像 ChatGPT 这样能够生成类人对话回复的 AI 模型，因其高度灵活性和处理广泛学生查询的能力，也被应用于编程教育中。

尽管有研究支持 AI 在教育中的作用，但目前缺乏系统且全面的综述，特别是对来自受控实验研究 (controlled experimental studies) 的证据进行综合分析。现有 AI 研究常因样本量小、实验控制不一致以及 AI 工具功能差异等局限性而受到批评，这挑战了研究结果的普遍性。这些文献空白凸显了进行系统综述和荟萃分析 (meta-analysis) 的必要性，以整合来自多样化受控实验研究的证据，更清晰地理解 AI 工具，特别是 ChatGPT，如何影响入门级编程课程的学习成果。

2.2. 核心贡献/主要发现

本研究的主要贡献和关键发现包括：

综合证据： 本文综合分析了 2020 年至 2024 年间进行的 35 项受控研究，对 AI 辅助学习在入门级计算机编程课程中的作用进行了稳健评估。
量化影响： 研究量化了 AI 工具 (如 ChatGPT 和 GitHub Copilot) 对学生表现、任务完成时间 (task completion time) 和感知理解难易度 (perceived ease of understanding) 的影响，提供了其教育有效性的统计学见解。
- AI 辅助学习显著缩短了任务完成时间 (SMD = -0.69)。
- AI 工具显著提高了学生表现评分 (SMD = 0.86)。
- AI 工具对学习成功或理解难易度没有统计学上的显著优势 (SMD = 0.16)。
- 学生对 AI 工具的看法普遍积极 (pooled estimate = 1.0)。
强调平衡集成： 尽管 AI 工具在提高任务效率和编程表现方面表现出色，但研究发现其对学生的概念理解和整体学习成功的影响有限，这强调了平衡集成的重要性。
学生接受度高： 汇总反馈表明学生对 AI 工具的接受度高且感知有用性强，突显了其在教育环境中的潜在价值。
提供实践指导： 本研究为教育工作者提供了可操作的指导，推荐采用适应性教学策略，以利用 AI 的优势，同时减轻过度依赖等风险。
奠定未来研究基础： 通过识别现有文献中的空白和当前 AI 辅助学习方法的局限性，本研究为未来 AI 驱动的教育实践研究奠定了基础。

3. 预备知识与相关工作

3.1. 基础概念

为了更好地理解这篇系统综述和荟萃分析，我们需要了解以下核心概念：

人工智能工具 (Artificial Intelligence Tools, AI tools): 指利用人工智能技术来辅助完成特定任务的软件或系统。在本文中，特指用于计算机编程教育的工具，如 ChatGPT (一种大型语言模型，能够生成文本、回答问题、提供代码解释等) 和 GitHub Copilot (一种 AI 辅助编码工具，能够根据上下文提供代码建议、自动完成代码等)。
系统综述 (Systematic Review): 是一种通过系统性地识别、评估和综合所有与特定研究问题相关的实证研究来总结现有证据的研究方法。它遵循预定义的协议，以减少偏倚并提供可靠的证据。
荟萃分析 (Meta-Analysis): 是一种统计学方法，用于将多个独立研究的结果进行定量合并和分析，从而得出一个更精确、更全面的总体估计。荟萃分析通常作为系统综述的组成部分，特别是在有足够数量和同质性的定量研究时。
受控实验研究 (Controlled Experimental Studies): 指至少包含一个实验组 (experimental group) 和一个对照组 (control group) 的研究设计。实验组接受某种干预 (intervention) (例如，使用 AI 工具)，而对照组不接受或接受不同的干预 (例如，传统教学方法)，旨在比较干预的效果。
学习成果 (Learning Outcomes): 指学生通过学习过程所获得的知识、技能、能力和态度。在本文中，具体衡量指标包括任务完成时间、学生表现评分、学习成功度、理解难易度以及学生对 AI 工具的感知。
标准化均差 (Standardized Mean Difference, SMD): 是一种效应量 (effect size) 指标，用于衡量两个组（通常是实验组和对照组）之间均值差异的大小，并以标准差 (standard deviation) 为单位进行标准化。当不同研究使用不同的量表测量相同结果时，SMD 尤其有用。 $\mathrm{SMD} = \frac{\bar{X}_1 - \bar{X}_2}{S_p}$ 其中：
- $\bar{X}_1$ 是实验组的平均值。
- $\bar{X}_2$ 是对照组的平均值。
- $S_p$ 是合并标准差 (pooled standard deviation)，表示两组数据的共同变异程度。
95% 置信区间 (95% Confidence Interval, 95% CI): 是一个范围，表示真实总体参数（如 SMD 的真实值）有 95% 的概率落在这个区间内。如果置信区间不包含 0，则通常认为结果具有统计学意义。
异质性 (Heterogeneity): 指荟萃分析中不同研究之间结果的差异性。
I² 统计量 (I² statistic): 是一个衡量荟萃分析中研究间异质性大小的指标，表示总变异中由真实效应差异而非随机误差引起的比例。 $I^2 = 100\% \times \frac{Q - df}{Q}$ 其中：
- $Q$ 是 Cochran's 异质性统计量，用于检验研究间是否存在异质性。
- df 是自由度 (degrees of freedom)，通常等于研究数量减 1。
- I² 值通常解释为：0-40% 可能不重要；30-60% 可能中等异质性；50-90% 可能实质性异质性；75-100% 可能显著异质性。
p 值 (p-value): 在统计假设检验中， $p$ 值是观察到当前数据（或更极端数据）的概率，前提是零假设 (null hypothesis) 为真。通常， $p$ 值小于 0.05 ( $p < 0.05$ ) 被认为是统计学显著的，意味着有足够的证据拒绝零假设。
敏感性分析 (Sensitivity Analysis): 在荟萃分析中，通过改变一些假设或排除某些研究（例如，排除高偏倚风险的研究）来重复分析，以评估结果的稳健性。

3.2. 前人工作

文章在引言中回顾了 AI 在教育，特别是编程教育中的应用，并提及了一些重要的前人工作：

智能辅导系统 (Intelligent Tutoring Systems, ITSs): 长期以来被用于创建个性化学习体验，尤其在 STEM 学科如数学和计算机科学中被证明能有效改善学生学习成果 [4]。这些系统能动态适应个体学习进度并提供量身定制的反馈。
GPT-3 及类似 AI 模型： 近期，像 ChatGPT 这样能够生成类人对话回复的 AI 模型，已应用于教育环境中进行问题解决、代码调试和概念学习 [5]。与 ITSs 不同，ChatGPT 等聊天机器人式 AI 工具不依赖于固定的预设回复，使其高度灵活。
GPT-3 在编程辅导中的应用： 初步研究表明，GPT-3 驱动的资源可以增强学生的解决问题能力、提高参与度，并加深对计算机编程概念的理解 [6, 7]。
担忧与局限： 尽管有积极结果，但研究也引发了对过度依赖 AI 生成解决方案的担忧，这可能导致对基本编程逻辑和批判性思维技能的关注减少 [8]。此外，AI 研究常因样本量小、实验控制不一致和工具功能差异而受到批评 [9, 10]。

3.3. 技术演进

AI 在编程教育中的应用经历了从早期相对固定的规则系统到如今高度灵活、生成式模型的演进：

智能辅导系统 (ITSs) 阶段： 早期 AI 在教育中的应用主要集中在 ITSs。这些系统通常基于预定义的规则和知识库，能够根据学生的输入提供个性化反馈和指导。它们擅长特定领域的知识传递和技能训练，但在灵活性和处理开放性问题方面存在局限。
代码辅助工具阶段： 随着机器学习和数据科学的发展，出现了像代码自动完成、静态代码分析工具等。这些工具能提供实时的语法检查、错误提示和代码建议，提高了编程效率，但仍主要集中在代码的局部优化和辅助。
生成式 AI (Generative AI) 阶段： 近年来，以大型语言模型 (Large Language Models, LLMs) 为代表的生成式 AI 技术取得了突破性进展。ChatGPT、GitHub Copilot 等工具能够理解自然语言指令，生成高质量的代码片段、解释概念、调试代码，甚至提供复杂的解决方案。这些工具的特点是高度的灵活性、上下文感知能力和强大的生成能力，极大地扩展了 AI 在编程教育中的应用范围。

本文的研究正是在生成式 AI 迅速普及的背景下进行的，旨在系统地评估这些新型 AI 工具对编程学习成果的实际影响，并与传统教学方法进行比较。

3.4. 差异化分析

本研究与现有相关工作相比，其核心区别和创新点在于：

综合性与严谨性： 现有研究多为孤立的、小规模的实验，缺乏对该领域证据的全面、结构化综合。本研究通过系统综述和荟萃分析 (systematic review and meta-analysis) 的方法，从 35 项受控实验研究中聚合证据，提供了更全面、更稳健的结论，减少了单一研究的局限性。
聚焦新兴 AI 工具： 特别关注了 ChatGPT 和 GitHub Copilot 等生成式 AI 工具，这些工具是近年来的热点，与传统的 ITSs 在功能和交互模式上存在显著差异。本研究填补了对这些新兴工具在编程教育中影响的系统性评估空白。
多维度学习成果评估： 不仅评估了学生表现 (performance) 和任务效率 (efficiency)，还深入探讨了学生对工具的感知 (perceptions)、学习成功度 (learning success) 和理解难易度 (ease of understanding)，提供了对 AI 辅助学习更全面的视角。
证据等级高： 通过严格筛选 受控实验设计 的研究，并进行质量评估和偏倚风险评估，确保了纳入研究的证据等级较高，从而提高了荟萃分析结果的可信度。
揭示异质性并提供洞察： 研究不仅给出了总体效应量，还通过 I² 统计量揭示了研究间的显著异质性。这种异质性促使讨论深入分析了 AI 工具设计、学生经验水平、任务复杂性、课程设计等因素对有效性的影响，为未来的教学策略和工具开发提供了关键洞察。

4. 方法论

4.1. 方法原理

本研究采用系统综述和荟萃分析 (systematic review and meta-analysis) 的方法，旨在系统地识别、评估并综合关于人工智能 (AI) 工具对计算机编程学习成果影响的现有实证研究。这种方法论选择是为了克服单一研究的局限性（例如样本量小、实验控制不一致），通过量化合并多个研究的结果，提供一个更全面、更可靠的证据基础。研究遵循了 PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 指南，以确保透明度、可重复性和减少偏倚。

具体来说，方法原理包括：

明确研究问题： 聚焦 AI 工具（尤其是 ChatGPT 和 GitHub Copilot）对入门级计算机编程学习成果的影响。
系统文献检索： 在多个学术数据库中执行预定义好的搜索策略，以最大化地捕获所有相关的、高质量研究。
严格筛选研究： 根据预设的纳入和排除标准，对检索到的文献进行两阶段筛选，以确保只包含符合要求的受控实验研究。
数据提取： 从每个纳入研究中提取标准化数据，包括研究设计、参与者特征、干预细节、对照组信息和关键学习成果指标。
质量评估与偏倚风险评估： 使用专业的工具评估纳入研究的方法学质量和偏倚风险，以评估证据的可靠性。
荟萃分析： 运用统计学方法合并来自不同研究的效应量，计算出总体效应，并评估研究间的异质性。
结果解读与综合： 综合所有发现，讨论 AI 工具的有效性，其影响因素，以及对教育实践和未来研究的启示。

4.2. 核心方法详解

本研究严格遵循 PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 指南和 Kitchenham 提出的方法论 [12] 进行，确保了研究的透明度、严谨性和可重复性。

4.2.1. 资格标准 (Eligibility Criteria)

本系统综述和荟萃分析旨在评估 AI 工具（如 ChatGPT）对入门级计算机编程课程学习成果的影响。纳入研究需满足以下特定资格标准：

同行评审 (Peer-reviewed)： 仅包含经过同行评审的文献。
数值数据 (Numerical data)： 必须提供数值数据。
明确的研究设计 (Well-defined study design)： 仅考虑评估 AI 工具在计算机编程教育中有效性的受控实验 (controlled experiments) 研究。
出版时间： 摘要中提及分析了 2020 年至 2024 年间发表的论文。方法部分提到搜索限制在 2000 年至 2024 年间，但纳入标准明确指出仅考虑自 2010 年以来发表的研究。实际纳入的 35 项研究均在 2020 年至 2024 年间发表。
语言： 仅考虑英文发表的文献。

4.2.2. 检索策略 (Search Strategy)

研究在以下多个学术数据库中进行了全面而系统的检索：

Web of Science
Scopus
IEEE Xplore
ACM Digital Library
Lens.org

检索策略结合了关键词 (keywords) 和布尔运算符 (Boolean operators) (AND, OR) 来细化结果。主要的搜索词包括："Artificial Intelligence", "AI tools", "ChatGT", "learning outcomes", "introductory programming", "controlled experiments", 和 "educational technology"。

以下是部分数据库的检索字符串示例：

Web of Science: ("artificial intelligence" OR "AI" OR "ChatGPT") AND ("learning outcomes" OR "programming skills" OR "coding") AND ("introductory programming" OR "programming education") AND ("controlled experiment" OR "randomised controlled trial" OR "quasi-experimental")
Scopus/ACM: ("artificial intelligence" OR "AI" OR "ChatGPT") AND ("learning outcomes") AND ("introductory programming") AND ("experimental study")
IEEE: ("artificial intelligence" OR "AI" OR "ChatGPT") AND ("learning outcomes") AND ("introductory programming") AND ("experimental study") AND ("controlled experiment" OR "randomised controlled trial" OR "quasi-experimental")

为了保持方法学的严谨性，检索语法根据各数据库的索引机制、搜索字段限制和布尔逻辑处理的差异进行了微调。尽管存在这些调整，核心搜索结构和关键词概念在所有数据库中保持一致，以确保透明度、可比性和检索策略的可重复性。

4.2.3. 研究选择与筛选 (Study Selection and Screening)

研究选择过程遵循两阶段筛选方法：

第一阶段： 根据标题 (titles) 和摘要 (abstracts) 筛选相关性。
第二阶段： 对通过第一阶段筛选的文献进行全文审查 (full-text review)，以确保其符合纳入标准。

纳入条件包括：
受控实验设计，评估 AI 工具（如 ChatGPT）在计算机编程教育中的作用。
提供量化的学习成果指标 (quantitative learning outcome metrics)（例如，测试分数、任务完成率、代码质量）。
发表在同行评审期刊或会议上。

排除标准包括：
缺乏实证数据。
非同行评审。
未具体研究 AI 辅助的计算机编程学习。
仅考虑英文发表的文献。

PRISMA 2020 指南用于系统地记录检索、筛选、纳入和排除的记录数量，以确保透明度。

4.2.4. 数据提取 (Data Extraction)

数据提取包含了以下关键信息：

研究作者。
研究设计 (study design)，例如随机对照试验 (randomised controlled trial, RCT)。
研究的地理位置 (geographical location)。
人口学特征 (population characteristics)，例如年龄、教育背景和经验水平。
干预细节 (intervention details)，例如使用的特定 AI 工具（如 ChatGPT）。
对照组信息 (control group information)，例如对照组或比较组的性质。
评估的成果 (outcomes evaluated)，例如测试分数、表现指标和学习改进。

这些数据都被仔细记录，以便在后续的荟萃分析中进行比较和综合。

4.2.5. 纳入研究的质量评估 (Quality Assessment of Included Studies)

纳入研究的质量评估使用了既定的工具：

Newcastle-Ottawa 量表 (NOS, Newcastle-Ottawa Scale)： 用于评估非随机对照试验 (non-randomised controlled trials) 的偏倚风险，主要关注选择偏倚 (selection bias)、可比性 (comparability) 和结果评估 (outcome assessment)。
JBI 清单 (JBI Checklist)： 用于评估实验研究的方法学质量，确保严格的研究设计和报告。
问卷调查： 设计了一份问卷，以评估研究中测量的结果是否适当、有效、可靠，并与入门级编程的学习成果相关。这确保了只有高质量研究才被纳入系统综述和荟萃分析。

4.2.6. 偏倚风险评估 (Risk of Biased Assessment)

偏倚风险 (risk of bias) 使用 Review Manager 2 (RoB 2) 工具进行评估，重点关注以下领域：

选择偏倚 (Selection bias)
实施偏倚 (Performance bias)
检测偏倚 (Detection bias)
报告偏倚 (Reporting bias)
其他潜在偏倚 (Other potential biases)

对于每个领域，回答了信号问题 (signalling questions)，并将偏倚风险等级分配为：低 (low)、存在一些担忧 (some concerns) 或高 (high)。

4.2.7. 统计分析 (Statistical Analysis)

荟萃分析技术使用了 Review Manager 5, Stata 17 和 SPSS v27 软件进行。

效应量计算 (Effect size computation)： 对于每个研究，计算了效应量，其中 标准化均差 (Standardized Mean Differences, SMDs) 和 均差 (Mean Differences, MDs) 用于量化 AI 工具对计算机编程学习成果的影响。
模型选择 (Model selection)： 采用了随机效应模型 (random-effects model) 来解释研究间的变异性。
异质性评估 (Heterogeneity assessment)： 使用 I² 统计量进行，值超过 50% 表明存在实质性异质性。
亚组分析 (Subgroup analyses)： 进行了亚组分析，以探索潜在的调节因素，包括研究设计、AI 工具类型和参与者特征（尽管在结果中并未详细呈现）。
发表偏倚评估 (Publication bias assessment)： 使用漏斗图 (funnel plot) 和 Egger 检验 (Egger's test)。
敏感性分析 (Sensitivity analysis)： 通过系统性地移除高风险研究来评估研究结果的稳健性。
报告与统计显著性 (Reporting and statistical significance)： 所有效应量均报告 95% 置信区间 (CIs)，统计显著性设定为 $p < 0.05$ ，以确保报告的可重复性和透明度。

5. 实验设置

5.1. 数据集

本研究的“数据集”是由其严格筛选后纳入的 35 项受控实验研究组成的。这些研究的特征构成了荟萃分析的基础。

纳入研究数量： 35 项研究 (根据摘要和 Table A1 计数)。
发表时间： 2020 年至 2024 年之间。
研究设计： 包含随机对照试验 (RCTs) 和准实验设计 (quasi-experimental designs)，其中准实验设计是最常用的方法。
研究工具： 涵盖了 AI 驱动的工具，包括 ChatGPT、GitHub Copilot 和其他自动化编码助手，它们在编程教育中用于提供实时反馈、调试支持和代码生成。
地理分布： 研究地理分布广泛，包括美国 (14 项研究)、欧洲 (10 项研究) 和亚洲 (5 项研究)。

目标人群： 大多数研究针对本科生，但也包括高中生和参加在线计算机编程课程的成人学习者。

以下是原文 Table A1 中包含的 35 项研究的详细特征：

N	Authors	Study Design	Study Region	Population Characteristics	Intervention Details	Control Group Details	Outcomes Measured
1	[46]	Pre-test-post-test quasi-experimental design.	Hacettepe University, Ankara, Turkey.	A total of 42 senior university students (28 males, 14 females), volunteers.	Pair programming in the experimental group during a six-week implementation. Use of AI tools: (1) No	Solo programming performed in the control group.	Flow experience, coding quality, and coding achievement.
2	[13]	Mixed-methods study (work-in-progress).	United States.	Participants: Introductory Java programming students at a large public university in the United States.	(1) No external help. (2) AI chatbot. (3) Help of a generative AI tool, such as GitHub Copilot.	Condition (1): No external help.	Programming skills, students' experiences, and perceptions of AI tools. Improvement in coding skills (measured by coding assessments). User satisfaction (on perceived effectiveness). Addressing challenges of in-person and remote pair programming collaboration.
3	[47]	Mixed-methods approach (quantitative and qualitative evaluation).	University of Basilicata, Potenza, Italy.	University programming students, randomly assigned to experimental and control groups.	The experimental group used a Mixed Reality (MR) application with a conversational virtual avatar for pair programming.	Followed traditional in-person pair programming methods.	Coding assessments). User satisfaction (on perceived effectiveness). Addressing challenges of in-person and remote pair programming collaboration.
4	[14]	Mixed-methods approach, combining quantitative and qualitative analysis to evaluate the integration of generative AI (GenAI) in educational forums.	Purdue University. Conducted in four first- and second-year computer science courses during the Spring 2024 semester, each with substantial enrolments (~200 students per course).	Participants included teaching assistants (TAs) and approximately 800 undergraduate students in first- and second-year computer programming courses.	A generative AI platform, BoilerTAI, was used by one designated TA per course (AI-TA) to respond to student discussion board queries. AI-TAs exclusively used the platform.	Remaining TAs and students were in a controlled experimental environment. Responses were presented as if written solely by the AI-TA.	Efficiency: Overall, 75% of AI-TAs reported improvements in response efficiency. Response Quality: Overall, 100% of AI-TAs noted improved quality of responses. Student Reception: Positive responses from students to AI-generated replies (~75% of AI-TAs observed favourable feedback). Human Oversight: AI-TAs required significant modifications to AI responses approximately 50% of the time.
5	[48]	Type: Mixed-methods study reporting quantitative and qualitative data analysis. Duration: Conducted over a 12-week semester during the September 2022 offering of a CS1 course. Focus: Investigated the usage and perceptions of PythonTA, an educational static analysis tool.	University of Toronto. Conducted at a large public research-oriented institution in a first-year computer science (CS1) course.	A total of 896 students (42.5% men, 39.3% women, 1% non-binary, 17.2% undisclosed), with diverse ethnic backgrounds, enrolled in an introductory programming course (CS1).	Integration of PythonTA, a static analysis tool, in programming assignments; 10-20% of grades tied to PythonTA outputs; students encouraged to use PythonTA locally and via an autograder.	Implicit subgroups based on prior programming experience: None (Novices): No prior programming experience. Course: Formal programming education.	Potential for Integration: Overall, 75% of AI-TAs agreed on the potential for broader integration of BoilerTAI into educational practices. PythonTA usage frequency. Self-efficacy in responding to PythonTA feedback. Perceived helpfulness of PythonTA. Changes in coding practices and confidence over the semester.
6	[28]	Controlled experiment to measure programming skills through course scores and programming task performance.	Anhui Polytechnic University, China. University setting with focus on third-year and fourth-year students.	A total of 124 third-year undergraduate students and 5 fourth-year students. Background: Students studied C and Java programming languages as part of their curriculum.	Programming tasks (four levels): Included Java and C programming tasks with varying difficulty. Tasks assessed correctness and efficiency, using EduCoder platform for evaluation. Efficiency score calculated based on task completion time and accuracy.	Senior students served as a comparative baseline. Their performance on tasks 3 and 4 was compared with that of juniors to evaluate skill differences.	Correlation between course scores and programming skills. Identification of courses significantly related to programming skills (e.g., software engineering). Analysis of task completion rates, correctness, and efficiency scores. Comparison of junior and senior programming performance.
7	[15]	The study introduced CodeCoach, a programming assistance tool developed to enhance the teaching and learning experience in programming education. It focused on addressing coding challenges faced by students in constrained lab environments.	Virtual lab environment. Sri Lanka Institute of Information Technology (SLIIT) in Malabe, Sri Lanka.	A total of 15 students and 2 instructors who participated in a virtual lab session. The participants' performance and the tool's effectiveness were assessed during the session.	The primary intervention was the CodeCoach tool. The tool utilised the GPT-3.5 AI model to provide tailored hints and resolve programming errors. Key features included community forums, lab management capabilities, support for 38 programming languages, and automated evaluation for coding challenges.	The research compared outcomes with traditional instructional methods (e.g., direct instructor feedback and error correction). It highlighted the limitations of non-automated feedback methods due to time constraints and lack of scalability. The other performance with the tool was implicitly compared to traditional learning methods without such tools.	Effectiveness of the tool in enhancing programming learning. Student and instructor satisfaction with the tool. Number of AI-generated hints used by students. Individual student progress during coding tasks. Improvement in logical thinking and problem-solving skills related to programming Accuracy: How well the explanation described the code's functionality.
8	[21]	The study employed a comparative experimental design to evaluate code explanations generated by students and those generated by GPT-3. The primary focus was on three factors: accuracy, understandability, and length. The study involved two lab sessions and integrated thematic and statistical analyses.	First-year programming course at The University of Auckland, with data collected during lab sessions over a one-week period.	Participants: Approximately 1000 first-year programming students enrolled in the course. Experience Level: Novices in programming. Language Covered: C programming language. Tasks Assigned: Creating and evaluating code explanations for provided function definitions.	Treatment Group: Exposed to GPT-3-generated code explanations during Lab B. Content: Students evaluated and rated explanations (both GPT-3 and student-created) on accuracy, understandability, and length. Prompts: Students were also asked to provide open-ended responses on the characteristics of useful code explanations.	Comparison Group: Student-generated code explanations from Lab A, created without AI assistance. Evaluation: Students rated both their peers' and GPT-3's explanations to enable comparison.	Understandability: Ease of comprehending the explanation. Length: Whether the explanation was considered ideal in length. Thematic Analysis: Students' perceptions of what makes a good explanation (e.g., detailed, line-by-line descriptions). Effect Sizes: Differences in quality metrics (using Mann—Whitney U tests) between student and GPT-3 explanations. Preferences: Students preferred GPT-3 explanations for their clarity and precision.
9	[49]	A comparative design was used to assess the quality of code explanations created by students versus those generated by GPT-3. It involved a two-phase data collection process (Lab A and Lab B), with students evaluating explanations based on their accuracy, understandability, and length.	The University of Auckland, specifically in a first-year programming course.	Participants: Approximately 1000 novice programming students. Experience Level: Beginner-level programming skills. Focus Area: Understanding and explaining code in the C programming language.	The students in Lab B were presented with code explanations generated by GPT-3 alongside explanations created by their peers from Lab A. The students evaluated the explanations based on criteria such as accuracy, understandability, and length. The GPT-3-generated explanations were designed to serve as an example to scaffold the students' ability to explain code effectively.	The control group consisted of student-generated code explanations from Lab A. These explanations were rated and compared against those generated by GPT-3 in Lab B.	Accuracy: The correctness of the explanation in describing the code's purpose and functionality. Understandability: The ease with which students could comprehend the explanation. Length: Whether the explanation was of an appropriate length or meaning. Student Preferences: Insights into what students value in a "good" code explanation (e.g., clarity, detail, structure). Quantitative Metrics: Statistically significant differences were observed, showing that GPT-3 explanations were rated higher for accuracy and understandability but had similar ratings for length compared to student explanations.
10	[22]	The study employed a quasi-experimental design to evaluate the effects of ChatGPT intelligent feedback compared to instructor manual feedback on students' collaborative programming performance, behaviours, and perceptions. The research combined quantitative and qualitative methods, including learning analytics approaches, thematic analysis, and performance assessment, to investigate the cognitive, regulative, and behavioural aspects of the intervention.	A graduate-level face-to-face course titled "Smart Marine Metastructure" at a top university in China during the summer of 2023.	Participants: A total of 55 participants, including 13 doctoral and 42 master students. Grouping: The students were arranged into 27 groups (2-3 students per group) based on their pre-course programming knowledge scores, balancing groups with higher and lower scorers. Programming Context: The course emphasised advanced applications of artificial intelligence in ocean engineering, using Matlab for collaborative programming tasks.	Experimental Group: A total of 13 groups received ChatGPT intelligent feedback, delivered as both textual and video feedback created by a virtual character. Feedback Mechanism: ChatGPT was prompted to assess Matlab code and generate specific suggestions for improvement across five dimensions.	Control Group: A total of 14 groups received instructor manual feedback, which provided text-based evaluations in five dimensions: correctness, readability, efficiency, maintainability, and compliance.	Performance Metrics: Post-course programming knowledge (assessed out of 100 points). Quality of group-level programming products (evaluated by the instructor). Collaborative Programming Behaviours: Analysed using learning analytics (e.g., clickstream analysis, epistemic network analysis). Focused on cognitive-oriented discourse (control) and regulation-oriented discourse (experimental). Perceptions of Feedback: Students' self-reported strengths and weaknesses through thematic analysis. Behavioural Analysis: Frequency and patterns of collaborative monitoring, code writing.
11	[36]	A controlled, between-subjects experiment was used to assess the impact of ChatGPT usage on the learning of first-year undergraduate computer science students during an Object-Oriented Programming course. Students were divided into two groups: one encouraged to use ChatGPT (treatment group) and the other discouraged from using it (control group). Performance on lab assignments, midterm exams, and overall course grades was evaluated using statistical analysis.	University of Maribor, Faculty of Electrical Engineering and Computer Science (FERI), Slovenia, during the spring semester of 2023.	Participants: A total of 182 first-year undergraduate computer science students, aged approximately 19.5 years, with 85.9% identifying as male, 12.4% as female, and 1.7% undisclosed. Division: Randomly divided into two groups, Group I (ChatGPT users) and Group II (non-ChatGPT users), each initially containing 99 participants.	Group I (Treatment): Students were encouraged to use ChatGPT for practical programming assignments. Assignments were adjusted to reduce reliance on ChatGPT, including modifications such as minimal textual instructions, provided UML diagrams, and extension tasks requiring independent effort during lab sessions. Defences: Students defended their assignments in lab sessions through interactive questioning, ensuring comprehension and reducing reliance on ChatGPT. Exams: Paper-based midterm exams assessed Object-Oriented Programming knowledge without allowing ChatGPT or other programming aids. Feedback Questionnaires: Weekly and final feedback questionnaires collected data on ChatGPT usage, assignment complexity, and student perceptions.	Group II (Control): Students were instructed not to use ChatGPT for assignments. Weekly questionnaires confirmed adherence, and eight participants who reported ChatGPT usage were excluded from the analysis. Assignments: The same assignments as Group I, with adjustments to prevent reliance on AI-based direct code generation.	Lab assignment performance was assessed weekly, with success rates calculated as percentages of completed mandatory and optional assignments. Results: No statistically significant difference in lab performance between Group I (65.27%) and Group II (66.72%). Midterm Exam Results: Two paper-based exams assessing theoretical and practical programming knowledge. Results: No significant difference between Group I and Group II (Group I: 65.96%, Group II: 66.58%). Overall Course Grades: Combination of midterm exam scores (50%) and lab assignments (50%). Results: No statistically significant difference in overall success between Group I (65.93%) and Group II (66.61%). Student Perceptions: Group I participants reported benefits, such as occasional inaccuracies and reduced learning engagement. Usage Trends: Group I students primarily used ChatGPT for code optimisation and comparison rather than direct code generation.
12	[19]	The study employed a controlled experimental A/B design to examine the impact of ChatGPT 3.5 on CS1 (introductory computer science) student learning outcomes, behaviours, and resource utilisation. Participants were divided into two groups: experimental (with ChatGPT access) and control (without ChatGPT access).	The experiment was conducted in Spring 2023 at a private, research-intensive university in North America offering a Java-based CS1 course.	Initially, 56 students were recruited, but 48 participants submitted valid screen recordings. Experimental Group: Twenty-three students. Control Group: Thirty-three students. Demographics: The groups were balanced in terms of Java programming experience and midterm scores (Mann-Whitney U-tests showed no significant differences). Tasks: The students designed UML diagrams, implemented Java programming tasks, and completed a closed-book post-evaluation.	Experimental Group: Access to ChatGPT 3.5 and other online resources (e.g., Google, Stack Overflow). Allowed to consult ChatGPT for task completion but were not obligated to use it. Tasks: UML Diagram: Create class structures, including relationships, fields, and methods. Java Programming: Implement class skeletons based on UML diagrams. Post-Evaluation: Answer conceptual and coding-based questions testing Object-Oriented Programming (OOP) principles.	Control Group: Access to all online resources except ChatGPT. Similar tasks and requirements as the experimental group, ensuring consistency.	Learning Outcomes: Graded on UML diagrams (40 points), programming tasks (40 points), and post-evaluation quizzes (8 points). Time spent on tasks (UML design, programming, post-evaluation) was also recorded. No statistically significant differences in performance or completion time were observed between groups. Resource Utilisation: ChatGPT Group: Relied predominantly on ChatGPT, with minimal use of traditional resources (e.g., lecture slides, course materials, or Piazza). Non-ChatGPT Group: Utilised a broader range of educational resources. Perceptions: The post-survey captured attitudes toward ChatGPT's utility, ethical concerns, and reliability implications. Most participants exhibited neutral or slightly positive attitudes with significant concerns about over-reliance and ethical implications.
13	[50]	The study used a quasi-experimental design with two groups to evaluate the impact of automatically generated visual next-step hints compared to textual feedback alone in a block-based programming environment (Scratch). The study aimed to assess the effects of these hints on students' motivation, progression, help-seeking behaviour, and comprehension.	The study was conducted in Bavaria, Germany, at a secondary school with two seventh-grade classes. Both classes were taught Scratch programming by the same teacher.	A total of 41 students aged 12-13 participated. Two cohorts: One class (19 students) received next-step visual hints (treatment group), and the other (22 students) received only textual feedback (control group). Four participants were excluded, resulting in 37 valid cases (15 in the treatment group and 22 in the control group).	Treatment Group: Received textual feedback about failing tests and visual next-step hints generated by the Catnip system, suggesting specific code changes (e.g., adding, deleting, or moving blocks). Hints were shown automatically after a failed test when students clicked the "Test" button. Task: Students followed a Scratch tutorial to create a simple game involving navigating a boat to avoid hitting walls. Tutorial steps included implementing functionalities such as player movement and collision detection. The activity lasted 25 min, followed by a post-test survey and comprehension tasks.	Received only textual feedback describing which tests had failed, without visual hints or next-step guidance. The tasks, setup, and duration were identical to the treatment group, ensuring comparability. The study did not include a traditional control group with strict AI restrictions. Instead, the analysis compared student submissions that adhered to the policy (i.e., reflections completed) versus those that did not.	Motivation: Measured using a five-point Likert scale in a post-test survey. Focused on students' enjoyment and confidence in completing the tasks. Progression: Tracked by the number of tutorial steps completed over time. Analysed using automated tests to determine whether students passed specific tutorial milestones. Help-Seeking Behaviour: The number of clicks on the "Test" button (indicating hint usage). The number of help requests directed to teachers. Comprehension: Evaluated through three post-task comprehension questions requiring students to identify correct code segments from provided options. AI Usage Patterns: Types of questions posed to ChatGPT (e.g., debugging vs. seeking full solutions). How students integrate AI-generated content into their submissions.
14	[39]	The study employed a descriptive and exploratory design to evaluate the impact of a permissive policy allowing students to use ChatGPT and other AI tools for programming assignments in CS1 and CS2 courses. The research focused on understanding how students in the United States utilised these tools, their learning outcomes, and reflective learning perceptions. Surveys, reflective learning forms, and coding scheme analysis were utilised to gather data on students' behaviour and attitudes toward AI tools.	A private Midwestern university in the United States, Spring 2023 semester.	Students: Enrolled in one CS1 and one CS2 section. Participants included students at varying levels of familiarity with AI tools. Sample Size: A total of 40 learning reflections were submitted by the students across multiple assignments. Pre- and post-semester surveys captured student perspectives on AI use. Demographics: Participants' attitudes and experiences with ChatGPT varied, allowing the study to capture a diverse range of perspectives.	AI Permissive Policy: Students could use ChatGPT and similar tools freely for assignments. Requirement: Students were required to submit a reflective learning form documenting the following: AI chat transcripts. Use of AI-generated content in submissions. Reflections on what they learned from AI interactions. Survey Data: Pre- and post-semester surveys gathered data on students' familiarity with AI, their use cases, and perceptions of academic honesty and policy.	Pre-semester attitudes with post-semester changes to assess shifts in perception and behaviour.	Learning Evidence: Analysed student reflections for understanding of programming concepts. Identified cases where AI interactions led to meaningful learning versus over-reliance on AI. Student Attitudes: Pre- and post-semester surveys captured changes in how students viewed AI tools ("useful" vs. "cheat"). Opinions on institutional AI policies. Challenges Identified: Cases of improper use or over-reliance on AI. Instances where AI solutions hindered learning (e.g., solving problems above students' skill levels). Recommendations: Strategies for promoting responsible and ethical use of AI in education.
15	[18]	Type: Controlled 2 × 2 between-subject experiment. Tasks: Two primary tasks: Coding Puzzles: Solve coding problems of moderate difficulty on an online platform with automated judging. Typical Development Task: Fix bugs in a small, self-contained project.	Mid-size public university in Beijing, China, focusing on Information and Communication Technology (ICT).	Total Participants: A total of 109 computing majors. Demographics: A total of 89 males and 20 females. Aged 20-26 years. Education Level: A total of 7 undergraduates, 102 postgraduate students. Experience: At least one year of professional software development, mostly as interns in major companies (e.g., Baidu, Tencent, Microsoft). The participants were randomly assigned to one of four groups.	Intervention Details: Participants in the intervention groups had access to ChatGPT (GPT-3.5) for task assistance. Tasks were completed in environments supporting code writing and debugging. ChatGPT accounts and internet access were provided for groups using the tool. Procedure: Tasks were completed within a 75 min timeframe. Interaction with ChatGPT was logged and analysed for insights. Tutoring System: Socratic or "teaching tips" style interactions. Assignment Guidelines: Students could request full or partial solutions from ChatGPT, debug their code, or seek help with isolated programming concepts. Students were encouraged to formulate their own solutions.	Control Group Details: Participants in the control groups completed tasks without ChatGPT assistance but could use the internet as a resource.	Efficiency: Time taken to complete tasks. Solution Quality: Assessed using a set of test cases. Subjective Perception: Measured through post-experiment surveys (e.g., perceived utility of ChatGPT). Task Load: Evaluated using NASA-TLX scales for workload. Interaction Patterns: Analysed from ChatGPT logs for insights into collaboration dynamics.
16	[24]	Pilot study using a pre-test-post-test design. Purpose: To evaluate the effectiveness of the Socratic Tutor (S.T.), an intelligent tutoring system, in improving programming comprehension and self-efficacy among novice programmers.	An urban university in Southeast Asia. Conducted in a computer lab.	Participants: A total of 34 computer science students enrolled in introductory programming courses. Demographics: Background questionnaire and a self-efficacy survey completed by all participants. Knowledge levels: Participants divided into two groups based on pre-test scores (TOP and BOTTOM).	Intervention Details: A Socratic ITS inspired by the Socratic method. Features: Programming language independence, scaffolding through guided questions, three-level feedback system. Session: A 60 min tutoring session with nine Java code examples. Tasks: Students analysed code, predicted outputs, and engaged in Socratic dialogues to address misconceptions. Feedback levels: Level 1: Conceptual explanation. Level 2: Fill-in-the-blank. Level 3: Multiple-choice question hints.	No explicit control group was used in the study; all participants received the same intervention. Pre-test scores served as a baseline for measuring improvements.	Learning Gains (LGs): Computed based on pre-test and post-test scores (each with nine Java programs requiring output prediction). An average improvement of 12.58% (from 75.82% pre-test to 88.4% post-test; LG score: 52.03%). Greater gains observed in the BOTTOM group (lower prior knowledge) compared to the TOP group (higher prior knowledge). Self-Efficacy: Measured via an 11-item survey on programming concepts. Participants with higher self-efficacy showed significant advantages. Feedback Effectiveness: Success rates for each feedback level.
17	[51]	A controlled experiment aimed at quantifying and comparing the impact of manual and automated feedback on programming assignments. It involved three distinct conditions: feedback from teaching assistants (TAs) only; feedback from an Automated Assessment Tool (AAT) only; and feedback from both TAs and the AAT. The study evaluated these conditions based on objective task effectiveness and subjective student perspectives.	The Bachelor of Software Development program at the IT University of Copenhagen (ITU), Denmark, during the first semester of 2022.	Participants: A total of 117 undergraduate first-semester students (20% women). Programming Background: Overall, 33% had little or no prior experience; 45% had limited experience; and 22% had prior programming experience. Context: The participants were enrolled in an introductory programming course (CS1).	Intervention Details: Programming Task: Solve a modified "FizzBuzz" problem, requiring nested if-else statements and Boolean conditions. Feedback Conditions: TAs Only—Students had access to formative feedback from qualified teaching assistants. AAT Only—Students received summative pass/fail feedback via the Kattis platform. TAs + AAT—Students could access both formative (TAs) and summative (AAT) feedback. Duration: A total of 1 h (with an additional 10 min, if needed). Metrics Recorded: Objective—Correctness (unit test results) and task duration, code smells (via SonarCube). Subjective—Student-reported frustration, unmet assistance needs, and feedback preferences.	Each feedback condition served as a control for the others. Random assignment to conditions ensured balance: TAs Only—39 students; AAT Only—42 students; and TAs + AAT—36 students.	Objective Metrics Correctness: Percentage of successful unit test cases. Students with both TAs and AAT feedback performed significantly better (p = 0.0024). Duration: Time spent on the assignment decreased from TAs Only to TAs + AAT (p = 0.028). Code Smells: Fewer code smells were found in solutions from the TAs + AAT condition compared to the TAs Only condition (p = 0.015). Subjective Metrics Frustration: Women reported higher frustration overall, particularly under the AAT Only condition (p = 0.068). Unmet Assistance Needs: Women in the AAT Only condition reported significantly more unmet assistance needs compared to the TAs or TAs + AAT conditions (p = 0.0083). Preferences: Women preferred TAs over AAT for feedback (p = 0.0023), while men exhibited no clear preference.
18	[44]	The study used a mixed-methods design, integrating quantitative experiments and qualitative thematic analysis to assess the impact of prompt-tuned generative AI (GenAI) conversational agents on computational thinking (CT) learning outcomes and usability. It used an ABBA experimental design to compare control and intervention phases, as well as reflection reports for in-depth insights into long-term use.	A university in Switzerland, within a software design class during the fall semester of 2023.	Participants: A total of 23 undergraduate students were initially enrolled; 21 completed the study. Demographics: A total of 8 females and 15 males. Background: Students in their third year of bachelor's studies in business and economics, with varying programming experience: 17 students reported prior programming knowledge; 2 reported no experience; 2 did not respond. Previous Interaction with ChatGPT: Average of 6.11 months of use prior to the course.	Intervention Details: Condition B (intervention): Labs 2 and 3 with the CT-prompt-tuned Graasp Bot. This bot was a generative AI conversational agent tuned with specific prompts to guide students in computational thinking.	Control Group Details: Condition A (control): Labs 1 and 4 with a default, non-configured Graasp Bot. This served as a baseline AI agent without specialized CT prompting.	Usability Metrics (seven-point Likert scale): Usefulness. Ease of Use. Learning Assistance. Learning Outcomes: Lab assignment scores, normalised to a 100-point scale. Accuracy rates for exercises attempted. Attitudes Towards Chatbots: Measured using the General Attitude Towards Robots Scale (GAToRS) before and after the study. Reflection Reports: Student perceptions of strengths and limitations of ChatGPT and Graasp Bot. Interaction Logs: Number and nature of interactions with ChatGPT and Graasp Bot.
19	[43]	Mixed-methods, large-scale, controlled study with 120 participants.	Three academic institutions in the United States: Northeastern University (R1 University), Oberlin College (Liberal Arts College), and Wellesley College (Women's College).	Participants: University students who had completed a single introductory computer science course (CS1). The population included first-generation students, domestic and international students, and participants from diverse academic and demographic backgrounds.	Intervention Details: Participants interacted with Codex via a web application called "Charlie the Coding Cow". Tasks included creating natural language prompts for 48 problems divided into 8 categories. Problems were designed at the CS1 skill level, and correctness was tested automatically.	Control Group Details: There was no traditional control group; all participants interacted with the intervention (Charlie). Comparisons were made within the group based on variables such as prior experience, demographics, and problem difficulty.	Success rate: Fraction of successful attempts at solving problems. Eventual success rate: Final success after multiple attempts. Pass@1: Probability of success from prompt sampling. Surveys and interviews captured perceptions and strategies.
20	[41]	Between-subjects design. The experimental group used ChatGPT exclusively, while the control group used other resources except genAI tools.	Software engineering courses and universities EPIC Lab.	Participants: Undergraduate software engineering students (N = 22) with low to medium familiarity with Git, GitHub, and Python.	Intervention Details: ChatGPT was used for completing three tasks related to software engineering (debugging, removing code smell, using GitHub). Task Structure: Students used ChatGPT to solve tasks.	Control Group Details: Participants in the control group could use any online resources except genAI tools for the same tasks.	Productivity (task correctness), self-efficacy, cognitive load (NASA TLX), frustration, and participants' perceptions of ChatGPT's faults and interactions.
21	[52]	Type: Survey-based research using quantitative and qualitative methods. Focus: To understand students' use patterns and perceptions of ChatGPT in the context of introductory programming exercises. Structure: Students completed programming tasks with ChatGPT-3.5 and then responded to an online survey regarding their usage patterns and perceptions. Research Questions: What do students report on their use patterns of ChatGPT in the context of introductory programming exercises? How do students perceive ChatGPT in the context of introductory programming exercises?	Goethe University Frankfurt, Germany. Timeframe: Winter term 2023/24 (starting from 6 December 2023).	Sample Size: A total of 298 computing students enrolled in an introductory programming course. Demographics: The majority were novice programmers: 34% had no programming experience, 43% had less than one year, 17% had 1-2 years, and 6% had over three years of experience. The majority had prior experience using ChatGPT (84%).	Intervention Details: Students completed a newly designed exercise sheet comprising tasks involving recursion, lists, functions, and conditionals. Tasks required interpreting recursive code, solving algorithmic problems, and generating code with optional recursion. Tool: ChatGPT-3.5, accessed via the free version. Instructions: The students used ChatGPT independently without structured guidance, except for a link to OpenAI's guide on prompt engineering. They were asked to record all prompts and responses as paired entries for submission.	Control Group Details: Not Applicable: The study did not include a formal control group for comparison, as all students in the course used ChatGPT-3.5 as part of the task.	Use Patterns: Frequency, duration, and purpose of ChatGPT usage during the programming exercises. Common use cases included problem understanding, debugging, and generating documentation. Perceptions: Students' evaluations of ChatGPT's ease of use, accuracy, relevance, and effectiveness. Analysis of positive and negative experiences through Likert-scale responses and open-ended survey answers. Challenges Identified: Over-reliance, inaccuracies in AI-generated responses, and the need for critical engagement. Pedagogical Implications: Insights for promoting and guiding the use of GenAI tools in programming education.
22	[53]	Mixed-methods research combining thematic analysis and quantitative analysis. Focus: The study explored how novice programmers aged 10-17 use and interact with AI code generators, such as OpenAI Codex, while learning Python programming in a self-paced online environment. Structure: Ten 90 min sessions.	Recruited from coding camps in two major cities.	Sample Size: A total of 33 novice learners in the experimental group (19 in the study). Age Range: Aged 10-17 years old (mean = 13.7, SD = 1.7). Demographics: A total of 11 females and 22 males. A total of 25 participants were English speakers. Attended the sessions remotely. Experience: None of the participants had prior experience with text-based programming.	Intervention Details: Tool: AI code generator based on OpenAI Codex embedded in the Coding Steps IDE. Environment: Self-paced online Python learning platform providing the following: A total of 45 programming tasks with increasing difficulty. AI code generator for generating solutions. Real-time feedback from remote instructors. Python documentation, worked examples.	Control Group Details: Condition: Participants in the control group had no access to the AI code generator and completed the tasks manually using only the provided materials.	Retention post-test scores (one week after the study). Immediate post-test scores. Behavioral aspects (AI code generation, code modification). 40 multiple-choice questions on programming concepts. No AI access was allowed during the evaluation phase. Behavioural Analysis: Frequency and context of AI code generator use. Coding approaches (e.g., AI Single Prompt, AI Step-by-Step, Hybrid, Manual). Code Quality: Properties of AI-generated code (correctness, complexity, alignment with curriculum). Utilization patterns of AI-generated code (e.g., verification, placement, modification).
23	[54]	Mixed-methods research combining formative studies, in-lab user evaluations, and computational assessments. Objective: To evaluate the effectiveness of HypoCompass in training novice programmers on hypothesis construction for debugging, using explicit and implicit scaffolding.	A private research institution in the United States.	Main Study: A total of 12 undergraduate and graduate students with basic Python programming knowledge but limited expertise. Screening Survey: Of 28 students, 12 were selected for the study. Background: The average age was 22.5, with nine females, three males, and seven non-native English speakers. Pilot Studies: Conducted with eight additional students.	Intervention Details: HypoCompass: An LLM-augmented interactive tutoring system designed to complete the following: Simulate debugging tasks with buggy codes generated by LLMs. Facilitate hypothesis construction through role-playing as teaching assistants (TAs). Provide feedback and scaffolding via hints, explanation pools, and code fixes. Process: Participants completed pre- and post-tests designed to evaluate their debugging solutions. Tasks included creating test suites, hypothesising about bugs, and revising faulty code. Duration: Each session lasted ~1 h, including pre-survey, interaction, post-test, and post-survey.	Control Group Details: Preliminary Control Conditions: Control—LLM: Practice materials generated by LLMs without HypoCompass. Control—Conventional: Traditional debugging exercises from a CS1 course. Preliminary results suggested HypoCompass outperformed the controls in learning gains, but larger-scale studies were planned.	Quantitative Metrics: Pre- to post-test performance: Improvement: Significant 16.79% increase in debugging accuracy. Efficiency: A 12.82% reduction in task completion time. Hypothesis construction: Comprehensive (LO1): Marginal 2.50% improvement; 26.05% time reduction. Accurate (LO2): Significant 27.50% improvement; 7.09% time reduction. Qualitative Feedback: Students found HypoCompass engaging, helpful, and motivating for debugging. Feedback highlighted the usefulness of specific explanations. Concerns included the potential for over-reliance on scaffolding and UI preferences for natural coding environments. Instructor Insights: HypoCompass could supplement CS1 education. Suggestions included modularising features for easy integration.
24	[55]	Mixed-methods research incorporating quantitative user studies (controlled experiments) and qualitative surveys and interviews.	A university setting, involving undergraduate computer science courses, specifically first- and third-year programming and software engineering students.	Participants: First-year programming students learning foundational coding skills. Third-year software engineering students contributing to the development of an intelligent tutoring system (ITS). Tutors involved in grading and feedback. The participants were divided into two groups.	Intervention Details: An ITS integrating automated program repair (APR) and error localisation techniques was deployed. First-year students used ITS for debugging assistance and feedback on programming assignments. Third-year students incrementally developed and enhanced ITS components as part of their software engineering projects. The experimental group received GPT-generated hints via the GPT-3.5 API.	Control Group Details: Group B (control): First-year students completed programming tasks using conventional tools without access to an ITS. Compared against Group A, who received ITS feedback and guidance during tasks.	For first-year students: Performance metrics (number of attempts, success rates, rectification rates, and rectification time for solving programming tasks). Feedback satisfaction and usefulness from surveys. For tutors: Usability and satisfaction with error localisation and feedback tools, alongside grading support. For third-year students: Experience and skill development from contributing to ITS projects.
25	[56]	The study employed a controlled experimental design comparing two groups: an experimental group with access to GPT-generated hints and a control group without access to GPT-generated hints.	The Warsaw University of Life Sciences (Poland), specifically within an Object-Oriented Programming course.	Participants: Second-semester computer science students enrolled in an Object-Oriented Programming course. Out of 174 students in the course, 132 students consented to participate. These students were familiar with the RunCode platform, as it had been used in a previous semester's programming course. The control group contained 66 students. The experimental group contained 66 students. A pre-test established that the two groups had no significant differences in baseline knowledge.	Intervention Details: GPT-generated hints were integrated into the RunCode platform for 38 out of 46 programming assignments. These hints provided explanations of errors, debugging tips, and suggestions for code improvement. The GPT feedback was dynamically generated in Polish and tailored to the submitted code, compiler errors, runtime errors, or unit test failures. The hints emphasised meaningful insights without revealing the correct code solution. The experimental group rated the usefulness of the GPT hints on a five-point Likert scale.	Control Group Details: The control group had access only to the platform's regular feedback, which included details about compiler errors, runtime errors, and unit test results. The control group did not receive GPT-generated hints and relied on standard feedback to resolve issues in their submissions.	Immediate performance: Percentage of successful submissions on consecutive attempts. Learning efficiency: Time taken to solve assignments. Reliance on feedback: Usage of the platform's regular feedback (non-GPT) for tasks with and without GPT-generated hints. Affective state: Emotional states (e.g., focused, frustrated, bored) reported during task completion. Impact of GPT feedback absence: Performance on tasks without GPT-generated hints after prior exposure to GPT-enabled tasks. User satisfaction: Perceived usefulness of GPT-generated hints (rated on a Likert scale).
26	[57]	Mixed-methods (qualitative and quantitative) study. Purpose: To investigate the characteristics influencing students' programming ability based on their course scores and questionnaire responses. Design: Analysis of course scores, questionnaire data, and expert evaluations.	A large public university in Taiwan, specifically in a Department of Computer Science.	Participants: 160 university students, predominantly from computer science and information engineering backgrounds.	Intervention Details: The students' course scores from programming-related classes were collected to serve as a benchmark for programming ability. The study investigated the following: Relationships between self-rated experience, confidence, and performance. The students' course scores from programming-related websites.	Control Group Details: The study did not employ a traditional control group but instead used a comparative analysis approach. Questionnaire responses were compared to course scores to identify relationships. A list of top-performing students, based on expert evaluations during the symposium, was used to validate the findings.	Significant factors influencing programming ability: Number of project modules (p.NumModule). Number of programming-related websites visited (s.NumSites). Correlation between course scores and questionnaire responses: Identified significant correlations, e.g., between course averages and project experience. Validation of findings: Comparison of top-performing students (based on expert lists) with results from a regression model. Development of a regression model: Predicting programming ability using the two identified indicators. Practical recommendations: The study provided insights for improving educational programs and recruitment strategies based on programming ability metrics.
27	[58]	Comparative study using a quantitative approach. Objective: To measure the confidence and self-rated experience levels of undergraduate and graduate students in software engineering when performing specific software tasks. It compares these self-assessments against actual performance to determine their accuracy and predictive power.	A university in Germany, focusing on students enrolled in bachelor-level courses, primarily in systems engineering, information technology, and computer science.	Participants: 154 students (100 undergraduates, 54 graduates) enrolled in bachelor-level courses, primarily in systems engineering, information technology, and computer science.	Intervention Details: Participants were given a series of 10 software engineering tasks. Tasks involved interpreting natural language stakeholder statements, represented in given models, by evaluating the accuracy of the models against the statements. Participants rated their confidence in their answers using a five-point scale. Self-rate experience post-task via a questionnaire on a five-point scale.	Control Group Details: Comparison Groups: Undergraduates vs. graduates: To assess differences in self-perception, confidence, and performance. Correct vs. incorrect answers: To evaluate confidence ratings. All participants underwent identical experimental tasks tailored to their academic level.	Correctness: Whether a task was completed correctly. Performance: The ratio of correct answers to total tasks. Confidence: Self-rated confidence on task-level answers. Self-rated experience: Average score based on post-task questionnaires. Graduate vs. Undergraduate Comparisons: Performance. Confidence in correct and incorrect answers. Accuracy of self-rated experience relative to performance. Key Findings: Confidence: A good predictor of task success, regardless of academic level. Self-rated Experience: Not correlated with performance, making it an unreliable predictor. Graduate vs. Undergraduate: Graduates performed better and rated their experience higher. No significant difference in confidence accuracy between the groups.
28	[59]	Mixed-methods observational study. Purpose: Evaluate the suitability and impact of ChatGPT on students' learning during a five-week introductory Java programming course. Research Questions (RQs): Effect of ChatGPT on learning progress. Suitability for implementation tasks and learning programming concepts. Effort required to adapt ChatGPT-generated code to programming exercises. Application scenarios for ChatGPT use. Reasons for not using ChatGPT.	A university offering bachelor's programs in information security. The course was part of a formal curriculum.	Participants: A total of 18-22 part-time undergraduate students. Demographics: Students enrolled in a bachelor's program in information security. Experience: The students had completed a previous semester's Python programming course.	Intervention Details: Duration: Five weeks. Course Structure: Five on-campus lectures. Five programming exercises covering the following: Object-Oriented Programming (OOP); Interfaces and Exception Handling; Collections; File I/O and Streams; Lambda Expressions and Multithreading. Exercises were submitted online for grading and feedback. Use of ChatGPT: Voluntary use for exercise preparation. ChatGPT versions: GPT 3.5 (66.6%) and GPT 4.0 (33.3%).	Control Group Details: No explicit control group was included. However, some students chose not to use ChatGPT, providing a natural comparison. Non-users cited the following: Desire to develop programming skills independently. Concerns about misleading or insufficient code. Preference for traditional learning methods.	Learning Progress: No significant correlation between exercises and perceived learning progress (p = 0.2311). Suitability for Tasks: Implementation Tasks: Mixed reviews; suitability varied by exercise. No significant relationship between exercises and ratings (p = 0.4928). Learning Programming Concepts: Predominantly rated suitable or rather suitable. Statistically significant relationship with exercises (p = 0.0001). Adaptation Effort: Minimal effort required to adapt ChatGPT-generated code to tasks. No significant correlation between exercises and adaptation effort (p = 0.3666). Application Scenarios: Common uses: Acquiring background knowledge (68%). Learning syntax and concepts (56%). Suggesting algorithms (47%). Used least for reviewing own solutions (28%). Reasons for Non-Use: Concerns about proficiency development. Misleading or incorrect outputs. Preference for independent work. Fundamental rejection of AI tools.
29	[16]	Type: Controlled experimental study. Objective: To evaluate the productivity effects of GitHub Copilot, an AI-powered pair programmer, on professional software developers.	Conducted remotely; the participants were recruited globally through Upwork. Setting: Tasks were administered via GitHub Classroom.	Participants: A total of 95 professional software developers recruited via Upwork; 35 completed the task. Age: The majority were aged 25-34. Geographic Distribution: Primarily from India and Pakistan. Education: Predominantly college-educated. Coding Experience: An average of 6 years. Workload: An average of 9 h of coding per day. Income: Median annual income between USD 10,000 and USD 19,000.	Intervention Details: Tool: GitHub Copilot, an AI pair programmer powered by OpenAI Codex. Task: Participants were asked to implement an HTTP server in JavaScript as quickly as possible. Process: The treated group was provided with GitHub Copilot and a brief 1 min installation instructions for GitHub Copilot and were free to use any additional resources, such as internet search and Stack Overflow. Task Administration: A template repository with a skeleton codebase and a test suite was provided. Performance metrics were tracked using timestamps from GitHub Classroom.	Control Group Details: Comparison: The treated group used GitHub Copilot. The control group relied on traditional methods, including internet resources and their own skills. Task Structure: Both groups were tasked with completing the same programming task under identical conditions (other than access to GitHub Copilot). The control group did not have access to GitHub Copilot. They could use external resources, such as internet search and Stack Overflow.	Task Completion Time: The treated group completed tasks 55.8% faster on average (71.17 min vs. 160.89 min). The improvement was statistically significant (p = 0.0017). Task Success: The success rate was 75%, though this was not statistically significant. Heterogeneous Effects: Developers with less coding experience, those aged 25-44, and those coding longer daily hours benefited the most. Self-Reported Productivity Gains: The treated and control groups estimated an average monthly willingness to pay for Copilot (USD 16.91). Economic Implications: Potential for AI tools, such as Copilot, to broaden access to software development careers by supporting less experienced developers.
30	[42]	Randomised controlled experiment. Objective: To evaluate the effectiveness of student-AI collaborative feedback (hint-writing) on students' learning outcomes in an online graduate-level data science course. Conditions: Baseline: Students independently write hints. AI Assistance: Students write hints with on-demand access to GPT-4-generated hints. AI Revision: Students write hints independently, review GPT-4-generated hints, and revise their hints.	University of Michigan. Course: Online Masters of Applied Data Science program, Data Manipulation course.	Participants: Adult learners with introductory knowledge of Python programming and statistics. Total Participants: A total of 97 students took the pre-test; 62 completed both the pre- and post-tests. The students were randomly assigned to the following groups: Baseline (20 students). AI-Assistance (20 students, after propensity score matching). AI-Revision (15 students). Demographics: Graduate students with varying levels of programming proficiency.	Intervention Details: Task: Students compared a correct solution to an incorrect solution for a programming assignment and write hints to guide correction of errors. Baseline: Students wrote hints independently. AI-Assistance: Students could access GPT-4-generated hints at any time while writing. AI-Revision: Students wrote independently first, reviewed GPT-4-generated hints, and revised their hints. Programming Tools: JupyterLab and Python. Assignment Grading: Programming assignments were automatically graded using the Nbgrader tool. Implementation: Assignments included an example task with guidance on writing effective hints. Incorrect solutions were selected using a similarity-based metric from a repository of prior incorrect submissions.	Control Group Details: Group: Baseline condition. Task: Students wrote hints independently, with no AI support.	Learning Outcomes: Pre-test: Assessed Python programming knowledge (10 MCQs, non-graded). Post-test: Assessed debugging and data manipulation skills (six MCQs, graded, worth 5% of the course grade). Findings: AI-Revision showed higher post-test scores on assessment and debugging, but not statistically significant (p = 0.18). AI-Assistance showed the lowest mean scores, indicating potential over-reliance on AI hints. Student Engagement: Positive feedback on hint-writing assignments, especially in the AI-Revision condition. Students valued the activity for improving debugging and critical thinking skills. Behavioural Insights: AI-Revision promoted critical evaluation and refinement of hints, enhancing learning. AI-Assistance encouraged reliance on AI-generated content, reducing independent effort.
31	[60]	Type: Crossover experimental study. Objective: To evaluate the effectiveness of just-in-time teaching interventions in improving the pedagogical practices of teaching assistants (TAs) during online one-on-one programming tutoring sessions. Intervention Duration: Participants received interventions immediately before each tutoring session, with each session lasting approximately one hour. Key Variables: Independent Variable: Presence or absence of the just-in-time teaching intervention. Dependent Variables: Duration and proportion of productive teaching events, tutor talk time, and self-reported perceptions.	A university computer science department. Environment: Online setting using a simulated tutoring scenario.	Participants: A total of 46 university students. Composition: Graduate and undergraduate computer science students. Recruitment: Recruited from department mailing lists. Demographics: Mix of experienced and novice tutors, with diverse teaching interests and abilities.	Intervention Details: Treatment Group: Shown a "teaching tips" screen before the tutoring session. Included pedagogical advice, including the following: Asking open-ended questions. Checking for student understanding. Encouraging the student to talk more during the session. Information about each student's lecture attendance. Tutoring Task: Participants role-played as tutors for an introductory programming task (FizzBuzz). Sessions featured a researcher acting as a student with two versions of buggy code to ensure variety.	Control Group Details: Control Group: Shown a logistical tips screen focusing on meeting setup and technical instructions (e.g., camera and microphone settings). Received no pedagogical advice. Exposed only to logistical reminders during the session.	Primary Outcomes: Productive Teaching Events: Time spent engaging students with effective teaching techniques. Proportion of session duration devoted to productive interactions. Tutor Talk Time: Ratio of tutor to student speaking time. Secondary Outcomes: Participants' ability to transfer learned teaching behaviours to subsequent sessions. Perceived usefulness of the intervention from participant interviews. Key Findings: Treatment group spent significantly more time in productive teaching events (1.4 times increase, Cohen's d = 0.72). Treatment significantly reduced tutor talk time, increasing opportunities for student participation (p < 0.05). Evidence of behaviour transfer to subsequent sessions was inconclusive but self-reported by 16 of 22 treatment-first participants. Student Performance: Not directly measured but impacted by tutoring quality.
32	[61]	Controlled quasi-experiment. Duration: Ten weeks. Purpose: Investigate the impacts of different designs of automated formative feedback on student performance, interaction with the feedback system, and perception of the feedback.	A large university in the Pacific Northwest of the United States.	Participants: A total of 76 students enrolled in a CS2 course. Group Assignment: Students were randomly assigned to three different lab sections, each treated as a group. Characteristics: The study included diverse participants in terms of programming experience and demographics (not explicitly detailed in the study).	Intervention Details: Feedback Types: Knowledge of Results (KR): Information on whether a test case passed or failed. Knowledge of Correct Responses (KCR): KR + detailed comparisons between expected and actual outputs. Elaborated Feedback (EF): KCR + one-level hints addressing common mistakes with additional explanations for misconceptions. Feedback Delivery: Automated feedback delivered through a system integrating GitHub, Gradle, and Travis-CI. Group Assignments: Group KR: Received KR feedback (25 students). Group KCR: Received KR + KCR feedback (25 students). Group EF: Received KR + KCR + EF feedback (26 students).	Control Group Details: Baseline: Group KR served as the control group, receiving the least detailed feedback. Absence of a no-feedback group: Deliberately excluded as research shows providing no feedback is less effective.	Outcomes Measured: Measured by the percentage of passed cases across three programming assignments. Student Interaction with the Feedback System: Metrics: Number of feedback requests (pushes to GitHub). Efforts evidenced by the number of changed lines of code. Behavioural Patterns: How students interacted with and utilised feedback. Student Perceptions: Assessed via five-point Likert scale survey and open-ended questions addressing the following: Frequency of feedback use. Expectedness of feedback. Likes/dislikes about the feedback system. Suggestions for improvement.
33	[23]	Pre-test-post-test quasi-experimental design.	Two universities in North Cyprus.	A total of 50 undergraduate students.	Intervention Details: Experimental Group: Used ChatGPT for solving quizzes. The experimental group was taught HTML programming using gamification elements (points, leaderboards, badges, levels, progress bars, rewards, avatars).	Control Group Details: Performed the quizzes without ChatGPT assistance. The control group was taught HTML programming using the traditional teaching method.	Outcomes Measured: Comparison of AI-assisted vs. manual performance. Programming skills (HTML tags, paragraphs, lists, multimedia, hyperlinks).
34	[62]	Quasi-experimental design with two groups (experimental and control).	Saudi Arabia.	Participants: Tenth-grade female students (N = 37), randomly assigned into experimental (N = 19) and control (N = 18) groups.	Intervention Details: Experimental Group: Taught HTML programming using gamification elements (points, leaderboards, badges, levels, progress bars, rewards, avatars). This implicitly means AI tools were used for gamification or AI-driven feedback was part of the gamified environment.	Control Group Details: The control group was taught HTML programming using the traditional teaching method.	Outcomes Measured: Programming skills (HTML tags, paragraphs, lists, multimedia, hyperlinks) and academic achievement motivation (desire to excel, goal orientation, academic persistence, academic competition, achievement behaviour, enjoyment of programming).
35	[17]	Quasi-experimental study comparing two groups (control vs. experimental) using a programming challenge.	A university setting (Prince Sultan University, College of Computer and Information Sciences).	Participants: Twenty-four undergraduate students (CS majors) who had completed CS101, CS102, and CS210 with a minimum grade of C+.	Intervention Details: Group B (experimental group): Had ChatGPT access for solving programming challenges.	Control Group Details: Group A (control group): Used textbooks and notes without internet access. The control group provided a realistic baseline for evaluating the impact of ChatGPT. Their learning potentially involved more structured problem-solving approaches. However, the group's overall scores and longer debugging times were noted.	Outcomes Measured: 1. Programming performance (scores)—Number of passed test cases. 2. Time taken—Efficiency in solving problems. 3. Code accuracy and debugging effort—Issues due to ChatGPT-generated code.

5.2. 评估指标

本荟萃分析主要评估了以下学习成果指标：

学生对 AI 工具的感知有用性和益处 (Perceived Usefulness and Benefits of AI Tools):
- 概念定义： 该指标衡量学生在使用 AI 工具后，对其在学习过程中的帮助程度、实用性和积极影响的主观评价。它关注学生对 AI 工具的接受度、满意度以及对编程理解、问题解决和参与度的促进作用。
- 数学公式： 本文使用汇总估计值 (pooled estimate) 来表示其总体流行率，通常是基于二元数据（如感知有用/无用）计算的效应量。在流行率的荟萃分析中，如果单个研究报告的是比例或概率，通常会使用 logit 变换或直接合并比例。 $\hat{P}_{pooled} = \frac{\sum_{i=1}^{k} n_i \hat{p}_i}{\sum_{i=1}^{k} n_i}$ 其中：
  - $\hat{P}_{pooled}$ 是汇总的感知有用性比例估计值。
  - $k$ 是研究数量。
  - $n_i$ 是第 $i$ 项研究的样本量。
  - $\hat{p}_i$ 是第 $i$ 项研究中感知有用性或益处的比例。
- 符号解释：
  - $\hat{P}_{pooled}$ : 汇总的感知有用性比例估计值。
  - $k$ : 参与荟萃分析的研究总数。
  - $n_i$ : 第 $i$ 项研究中的学生总数。
  - $\hat{p}_i$ : 第 $i$ 项研究中感知有用性或益处的学生比例。
任务完成时间 (Task Completion Time):
- 概念定义： 该指标衡量学生完成编程任务所需的时间。AI 工具可能通过提供代码建议、调试帮助或即时反馈来提高效率，从而缩短任务完成时间。
- 数学公式： 使用标准化均差 (Standardized Mean Difference, SMD) 来量化实验组（使用 AI 工具）与对照组（未使用 AI 工具）之间任务完成时间的差异。 $\mathrm{SMD} = \frac{\bar{X}_{\text{AI}} - \bar{X}_{\text{Control}}}{S_p}$ 其中：
  - $\bar{X}_{\text{AI}}$ 是使用 AI 工具组的任务完成时间平均值。
  - $\bar{X}_{\text{Control}}$ 是对照组的任务完成时间平均值。
  - $S_p$ 是两组的合并标准差。
- 符号解释：
  - $\mathrm{SMD}$ : 标准化均差。
  - $\bar{X}_{\text{AI}}$ : AI 组的任务完成时间平均值。
  - $\bar{X}_{\text{Control}}$ : 对照组的任务完成时间平均值。
  - $S_p$ : 合并标准差。
学习成功和理解难易度 (Success and Ease of Understanding):
- 概念定义： 该指标评估学生对编程概念的理解程度以及他们感知到的学习和掌握任务的难易程度。AI 工具可能通过提供解释、简化复杂概念或个性化指导来影响这些方面。
- 数学公式： 同样使用标准化均差 (SMD) 来量化实验组与对照组在学习成功和理解难易度方面的差异。 $\mathrm{SMD} = \frac{\bar{X}_{\text{AI}} - \bar{X}_{\text{Control}}}{S_p}$ 其中：
  - $\bar{X}_{\text{AI}}$ 是使用 AI 工具组在学习成功/理解难易度上的平均得分。
  - $\bar{X}_{\text{Control}}$ 是对照组在学习成功/理解难易度上的平均得分。
  - $S_p$ 是两组的合并标准差。
- 符号解释：
  - $\mathrm{SMD}$ : 标准化均差。
  - $\bar{X}_{\text{AI}}$ : AI 组在学习成功/理解难易度上的平均得分。
  - $\bar{X}_{\text{Control}}$ : 对照组在学习成功/理解难易度上的平均得分。
  - $S_p$ : 合并标准差。
学生表现 (Student Performance):
- 概念定义： 该指标衡量学生在编程任务或评估中的客观表现，通常通过分数、正确率或代码质量来体现。AI 工具可能通过改进编程技能、减少错误或提供更有效的解决方案来直接影响学生表现。
- 数学公式： 使用标准化均差 (SMD) 来量化实验组与对照组在学生表现评分上的差异。 $\mathrm{SMD} = \frac{\bar{X}_{\text{AI}} - \bar{X}_{\text{Control}}}{S_p}$ 其中：
  - $\bar{X}_{\text{AI}}$ 是使用 AI 工具组的学生表现平均得分。
  - $\bar{X}_{\text{Control}}$ 是对照组的学生表现平均得分。
  - $S_p$ 是两组的合并标准差。
- 符号解释：
  - $\mathrm{SMD}$ : 标准化均差。
  - $\bar{X}_{\text{AI}}$ : AI 组的学生表现平均得分。
  - $\bar{X}_{\text{Control}}$ : 对照组的学生表现平均得分。
  - $S_p$ : 合并标准差。
    
    这些指标均结合了 95% 置信区间 (95% CI) 和 p 值 (p-value) 进行统计显著性判断，并使用 I² 统计量 评估研究间的异质性。

5.3. 对比基线

本研究主要将采用 AI 工具辅助学习的实验组与传统教学方法或无 AI 工具辅助的对照组进行比较。具体而言：

传统教学方法： 许多对照组可能沿用传统的课堂教学、教材学习、人工反馈等方式。
无 AI 工具辅助： 对照组学生在完成编程任务时，被限制使用 AI 工具，但可能允许使用其他传统在线资源（如 Google、Stack Overflow），以确保对比的公平性，并突出 AI 工具的独特影响。

通过这种方式，研究旨在评估 AI 工具在现有教学实践基础上的额外增益。

6. 实验结果与分析

6.1. 核心结果分析

本系统综述和荟萃分析对 35 项受控研究进行了深入分析，揭示了 AI 工具在计算机编程教育中对多项学习成果的复杂影响。

6.1.1. 纳入研究特征与地理分布

共纳入 35 项研究，采用随机对照试验 (RCTs) 和准实验设计 (quasi-experimental designs)。这些研究检查了 AI 驱动工具（包括 ChatGPT、GitHub Copilot 和其他自动化编码助手）在增强实时反馈、调试支持和代码生成方面的作用。研究的地理分布广泛，涵盖美国（14 项）、欧洲（10 项）和亚洲（5 项），主要针对本科生，但也包括高中生和成人学习者。

以下是原文 Figure 1 的 PRISMA 流图，展示了文献筛选过程：

Figure 1. PRISMA flow diagram. 该图像是PRISMA流图，用于展示关于人工智能工具在计算机编程学习成果中的影响的系统性文献检索过程。图中显示了数据库搜索的步骤、文献筛选的标准以及最终纳入分析的研究数量和类别。

6.1.2. 文献计量分析

通过文献计量分析，研究评估了 AI 辅助计算机编程教育的研究格局和趋势。Figure 2 和 Figure 3 展示了关键作者、合著网络、术语演变以及学术文献趋势，表明自 2020 年以来，AI 相关计算机编程教育研究显著增加，反映了对 AI 工具集成到教育环境中的日益增长的兴趣。引用趋势表明 AI 驱动的教育工具在学术讨论中获得了实质性认可。

以下是原文 Figure 2 的文献计量分析图：

Figure 2. (A) Bibliographicanalysis of authors. (B) Co-authorship and citation network. (C) Visualisations of key authors in the impact of artificial intelligence tools, including ChatGPT, on learning outcomes in introductory programming. (D) Term cluster MESH terms. (E) Chronological term evolution. (F) Term density distribution within the scientific literature. 该图像是图表，展示了人工智能工具对计算机编程学习结果影响的各个维度，包括作者的文献分析、合著和引文网络、关键作者可视化、MESH术语聚类、术语的时间演变以及术语的密度分布。这些信息帮助理解AI工具在教育中的作用。

以下是原文 Figure 3 的引用和出版趋势图：

Figure 3. Citation and publication trends of AI tools in programming education. 该图像是图表，展示了编程教育中人工智能工具的引用和出版趋势。图中展示了不同年份的研究文献，研究时间越近引用数量越多，强调了人工智能工具在教育领域逐渐增长的重要性。

以下是原文 Figure 4 的荟萃分析逐步布局图：

Figure 4. Stepwise layout of the meta-analysis using Review Manager 5. 该图像是关于使用 Review Manager 5 进行网络分析的步骤布局。从打开新审查到添加研究标题、干预措施、数据和结果，展示了参数输入的一系列步骤。

以下是原文 Figure 5 的数据提取示意图：

Figure 5. Schematic diagram of data extraction and meta-analysis tables for Review Manager 5. 该图像是示意图，展示了进行元分析的步骤，包括选择纳入/排除标准、比较实验组与对照组以及结果汇总的过程。该图清晰地呈现了研究流程和关键环节，为理解元分析提供了直观参考。

6.1.3. 学生对 AI 工具的感知有用性和益处

学生对 AI 工具在入门级编程课程中的有用性和益处表现出高度一致的积极评价。

汇总估计值 (pooled estimate)： 1.0 (95% CI [0.92, 1.00])
异质性 (I²)： 0%

分析： 这一结果表明，几乎所有学生都认为 AI 工具对其学习体验有益。极低的异质性 ( $I² = 0%$ ) 表明各研究的结果高度一致。这突出了 AI 工具在学生中获得广泛接受，并被认为在增强计算机编程理解、问题解决技能和参与度方面具有价值。

以下是原文 Figure 6 的感知有用性和益处森林图：

$Figure 6. Forest plot \[1315\].$ 该图像是一个森林图，展示了不同研究的事件发生率及其加权比例。结果显示，综合效应模型下的比例为 1.00，95% 置信区间为 [0.92, 1.00]，具有 100% 的权重，表明对学习成果的影响均为积极。该图包括三项研究，各自的事件数和总数也被列出。

以下是原文 Figure 7 的感知有用性和益处漏斗图：

Figure 7. Funnel plot-perceived usefulness and benefits of AI tools. 该图像是图表，展示了AI工具在学习中感知的有用性和好处的漏斗图。图中反映了不同比例下的标准误差，帮助分析AI工具在计算机编程学习中的效果。

6.1.4. 任务完成时间

在任务完成时间方面，荟萃分析发现 AI 辅助学习能够适度减少完成时间。

标准化均差 (SMD)： -0.69 (95% CI [-2.13, -0.74])
异质性 (I²)： 95%
p 值： 0.34

分析： 负 SMD (-0.69) 表明使用 AI 工具的学生平均而言能更快完成编程任务。然而，这一结果的 $p$ 值 (0.34) 未达到统计学显著性 ( $p < 0.05$ )。更重要的是，高达 95% 的异质性 ( $I² = 95%$ ) 表明各研究间在 AI 工具对任务完成速度影响上存在显著变异。这可能归因于 AI 工具设计、学生经验水平和任务复杂性等差异。尽管如此，负 SMD 仍暗示了 AI 辅助学习在提高效率方面存在普遍趋势。

以下是原文 Figure 8 的任务完成时间森林图：

$Figure 8. Forest plot \[1619\].$ 该图像是一个森林图，展示了在计算机编程课程中，AI工具对学习结果的影响。图中列出了多项研究的标准化均值差异和置信区间，包括实验组和对照组的比较，以及异质性分析结果。总体分析显示，AI辅助学习在任务完成时间上显示出显著的负效应（SMD = −0.69, 95% CI [−2.13, −0.74]）。

以下是原文 Figure 9 的任务完成时间漏斗图：

Figure 9. Forest plot-task completion time. 该图像是图表，展示了任务完成时间的森林图。横轴为标准化均差（SMD），纵轴为标准误（SE(SMD)），图中显示了不同研究的效应量估计，及其95%置信区间。

6.1.5. 学习成功和理解难易度

在学习成功和理解难易度方面，AI 工具的作用不显著。

标准化均差 (SMD)： 0.16 (95% CI [-0.23, 0.55])
异质性 (I²)： 55%
p 值： 0.41

分析： SMD 为 0.16 表明使用 AI 工具的学生在感知成功和理解方面略有改善，但 $p$ 值 (0.41) 远未达到统计学显著性。中等异质性 ( $I² = 55%$ ) 表明这些结果在不同研究中存在变异。这可能源于 AI 工具的实施方式、编程任务复杂性或学生先验经验的差异。虽然 AI 工具可能对某些学习者有益，但总体而言，其对概念理解和感知成功的整体影响仍不确定。

以下是原文 Figure 10 的学习成功和理解难易度森林图：

$Figure 10. Forest plot \[2022\].$ 该图像是Figure 10的森林图，展示了使用人工智能工具（实验组）与未使用这些工具（对照组）在学习成果上的标准均值差异。总体的标准均值差异为0.16（95% CI [-0.23, 0.55]），表明在学习成功和理解难易方面未见统计显著优势。

以下是原文 Figure 11 的学习成功和理解难易度漏斗图：

Figure 11. Funnel plot-success and ease of understanding. 该图像是图表，展示了学习成功与理解难易之间的漏斗图。从图中可以看出，SMD（标准化均值差）与其标准误差（SE）之间的关系呈现出线性分布。整体结果表明，学习成功的有效性未达到统计学显著性。该图有助于分析AI工具对学习成果的影响。

6.1.6. 学生表现

AI 工具对学生表现产生了显著的积极影响。

标准化均差 (SMD)： 0.86 (95% CI [0.36, 1.37])
异质性 (I²)： 54%
p 值： 0.0008

分析： 显著的 SMD (0.86) 和高度统计学显著的 $p$ 值 (0.0008) 表明，使用 AI 辅助学习工具的学生在入门级编程课程中的得分显著高于传统学习环境中的学生。中等异质性 ( $I² = 54%$ ) 表明研究之间存在一定差异，这可能是由于 AI 工具设计、AI 辅助水平或评估方法的不同造成的。尽管存在这种变异，积极的效应量强烈支持 AI 工具对提高学生计算机编程能力有重要贡献。

以下是原文 Figure 12 的学生表现森林图：

$Figure 12. Forest plot \[17,19,23,24\].$ 该图像是一个森林图，展示了AI工具对计算机编程学习成果的影响。图中统计了四项研究的标准均差（SMD），显示使用AI工具的实验组在表现评分上明显优于对照组（SMD = 0.86, 95% CI [0.36, 1.37]）。图中还包含了异质性分析结果，表明多项研究间存在一些变异性。

以下是原文 Figure 13 的学生表现漏斗图：

Figure 13. Funnel plot-student performance. 该图像是图表，展示了学生表现的漏斗图。图中显示了标准化均差（SMD）与其标准误（SE(SMD)）之间的关系，提供了对AI工具在编程学习中的成效的可视化分析。

6.1.7. 敏感性分析

敏感性分析旨在评估研究结果的稳健性。

任务完成时间： 敏感性分析显示，任务完成时间的结果在排除单个研究时保持一致，表明该发现是稳健的。这支持了 AI 辅助学习工具在不同研究设置中持续有助于缩短任务完成时间的结论。

以下是原文 Figure 14 的任务完成时间敏感性分析图：

$Figure 14. Leave-one-out sensitivity analysis for task completion time \[17,19,23,24\].$ 该图像是图表，展示了不同研究中的任务完成时间的标准化均值差异（SMD）和95%置信区间。图中显示大多数研究在实验组与对照组之间的差异，并提供了针对整体效果的统计分析结果。
学习成功和理解难易度： 敏感性分析表明，这一指标的总体效应量对单个研究的纳入非常敏感。移除一项研究后，SMD 从 0.16 变为 -0.03，置信区间变得更靠近零， $p$ 值显著增加。这表明之前观察到的轻微积极效应并不稳定，AI 工具在感知成功和理解方面的效果可能因研究背景而异，总体效应不稳健。

以下是原文 Figure 15 的学习成功和理解难易度敏感性分析图：

$Figure 15. Leave-one-out sensitivity analysis for success and ease of understanding \[2022\].$ 该图像是针对学习成功和理解轻松度的留一法敏感性分析的示意图。图中展示了不同实验组和对照组在学习效果上的标准均差（Std. Mean Difference）及其置信区间，结果显示AI工具对编程学习的影响具有一定的变异性。
学生表现： 敏感性分析显示，学生表现的结果在排除单个研究时没有显著变化，证实了该发现的稳健性。这表明任何单个研究都没有过度影响总体结论。

以下是原文 Figure 16 的学生表现敏感性分析图：

$Figure 16. Leave-one-out sensitivity analysis for student performance \[17,19,23,24\].$ 该图像是图表，展示了针对不同研究的学生表现的敏感性分析结果。图中每个点代表一个实验组与对照组之间的标准均差（SMD）。这些数据显示，AI工具的使用在某些情况下显著提高了学生的学习效果，但结果存在一定的异质性，表现为 $SMD = 0.87$ ，95% CI [0.18, 1.57]。

6.2. 数据呈现 (表格)

本研究的数据呈现主要通过荟萃分析的森林图和漏斗图来展示，以视觉化方式呈现了各个学习成果指标的效应量、置信区间和异质性。纳入研究的详细特征 (Table A1) 已在 5.1. 数据集 部分转录。此外，讨论部分还提供了一个 AI 工具应用和教育用例的框架表格。

以下是原文 Table 2 的 AI 工具应用和教育用例框架：

Category	AI Tool Types	Support Modules	Educational Scenarios
Error Identification	Syntax Checkers, Debugging Assistants	Real-time error detection, correction hints	Code debugging exercises, assignment support
Code Generation	AI Code Generators (e.g., GitHub Copilot)	Code suggestion, template generation	Assignment drafting, coding practice
Natural Language Explanation	AI Tutors, Feedback Systems	Code concept explanation, algorithm walkthroughs	Lecture support, self-study modules
Scaffolding Support	Intelligent Prompt Systems	Guided hints, stepwise solution prompts	Problem-solving practice, project guidance
Assessment Support	Auto-Grading Systems	Automated evaluation and feedback	Assignment grading, formative assessment
Skill Enhancement	Adaptive Learning Platforms	Customised learning paths based on performance	Personalised learning, remediation plans

6.3. 消融实验/参数分析

本研究作为一项系统综述和荟萃分析，其性质决定了它不会像原始实验研究那样进行传统的消融实验 (ablation studies) 或参数分析 (parameter analysis)。然而，它通过以下方式实现了类似的目的：

异质性分析 (Heterogeneity Analysis)： 研究通过 I² 统计量量化了不同研究结果间的异质性。高异质性 ( $I² = 95%$ 对于任务完成时间， $I² = 54%$ 对于表现评分， $I² = 55%$ 对于成功和理解难易度) 表明 AI 工具的有效性可能受到多种因素的调节，例如 AI 工具设计、学生经验水平、任务复杂性、课程结构和教学方法等。这类似于在原始实验中分析不同组件或参数对结果的影响。
敏感性分析 (Sensitivity Analysis)： 通过系统地移除高风险研究或单个研究来评估总体结果的稳健性。例如，针对学习成功和理解难易度的敏感性分析显示，移除一项研究会显著改变总体效应量，这表明该结果对特定研究的依赖性较高，不如其他结果稳健。这揭示了哪些因素（或特定研究的特征）可能对结果产生较大影响。
亚组分析 (Subgroup Analysis)： 尽管方法论部分提及，但结果部分未详细报告亚组分析的具体发现。如果进行了亚组分析，它将进一步探讨不同研究设计、AI 工具类型、参与者特征等调节因素如何影响 AI 工具的有效性，类似于对模型不同“组件”或“参数”在不同“设置”下的表现进行分析。

这些分析帮助理解 AI 工具在编程教育中有效性的边界和影响因素，从而为未来的工具开发和教学策略提供指导。

7. 总结与思考

7.1. 结论总结

本系统综述和荟萃分析通过整合 35 项受控研究的证据，得出结论：AI 工具（包括 ChatGPT 和 GitHub Copilot）在入门级编程课程中，对学习成果产生了显著的积极影响。具体而言：

提升编程能力与效率： AI 辅助学习显著提高了学生的表现评分 ( $SMD = 0.86$ )，并适度缩短了任务完成时间 ( $SMD = -0.69$ )，这表明 AI 工具有助于学生更高效地完成编程任务并产出更高质量的代码。
学生高度认可： 学生对 AI 工具的有用性和益处普遍持积极态度 ( $pooled estimate = 1.0$ )，这反映了其在教育情境中的高接受度。
理解与成功度需关注： 尽管在表现和效率上有积极效果，但 AI 工具在提升学习成功度或理解难易度方面并未显示出统计学上的显著优势，且敏感性分析表明该结果的稳健性较差。这提示教育者不能仅依赖 AI 工具来深化学生的概念理解。
有效性具有情境依赖性： 研究发现各结果存在中高异质性 (I² 介于 54% 到 95% 之间)，表明 AI 工具的有效性受工具设计、学生特征、课程设计和实施质量等多种情境因素的影响。

总体而言，AI 驱动的学习工具正在重新定义计算机编程教育，通过提供个性化、高效和互动式的学习体验。然而，其广泛应用必须以基于证据的教学策略和持续评估为支撑，以确保在不同教育环境中保持其有效性。

7.2. 局限性与未来工作

本研究在讨论中坦诚地指出了当前研究领域和自身研究的局限性，并提出了未来的研究方向。

7.2.1. 局限性

普遍性超越受控实验的挑战： 许多纳入研究是在受控环境中进行的，这虽然有助于严格评估 AI 工具，但可能无法完全捕捉真实世界教育环境的复杂性。学生动机、教师专业知识、机构政策和课程整合等因素在实际应用中会显著影响 AI 的有效性。
对 AI 依赖的伦理担忧：
- 独立解决问题能力下降： 学生可能过度依赖 AI 生成的解决方案，导致分析性思维和编码逻辑能力退化。有研究表明，使用 AI 反馈的学生手动调试代码的时间显著减少，这引发了对深度概念理解的担忧。
- 学术诚信与剽窃： AI 生成的代码解释和调试方案可能鼓励被动接受而非主动解决问题，甚至导致剽窃行为。
- 算法偏倚： AI 驱动的辅导系统可能为不同学生群体提供不一致的学习体验，例如，对编程知识较少学生的提示效果可能较差，从而可能加剧学习差距。
侧重短期成果： 当前研究主体主要关注短期学习成果，如任务表现、作业完成率和即时评估分数。极少有研究系统地考察 AI 辅助学习对知识保留、技能迁移或在复杂真实编程项目中应用概念的长期影响。
学习者异质性与亚组分析不足： 纳入研究在参与者的编程背景、教育水平和学科领域方面存在显著差异，但缺乏详细的亚组比较分析。先验知识、学科背景和编程经验可能调节 AI 辅助学习工具的有效性。

7.2.2. 未来工作

为解决上述局限性，未来的研究应关注以下几个关键领域：

纵向研究： 开展纵向研究，评估 AI 辅助学习对编程能力、问题解决方法和认知技能发展的长期影响，包括知识保留和学生对自动化辅助的依赖性。
不同 AI 教育工具的比较分析： 需要对不同 AI 编程工具（如 GitHub Copilot、CodeCoach 和 PythonTA 等）进行全面的比较评估，包括可用性、学习成果和学生参与度，以指导教育者和决策者进行工具选择。
自适应 AI 学习系统的开发与评估： 进一步探索 AI 提供个性化、实时反馈的潜力。研究如何优化 AI 驱动的支架式教学 (scaffolding) 和个性化推荐，以提高编程信心和长期技能发展，并适应不同的学习进度和风格。
人机协作 (Human-AI Collaboration)： 探索 AI 与人类教师结合的混合学习模型。研究应侧重于开发 AI 教师协作框架，探讨如何在不削弱教师作用的情况下，将 AI 融入编程教育的最佳实践。理解 AI 和教师如何共同设计任务，平衡自动化辅助与动手学习至关重要。

7.3. 个人启发与批判

这篇系统综述和荟萃分析对理解 AI 工具在计算机编程教育中的作用提供了宝贵的循证见解，尤其是在当前生成式 AI 迅速发展的背景下。

7.3.1. 个人启发

效率与表现的显著提升： 论文明确指出 AI 工具能显著提高学生的编程表现和任务完成效率，这对于入门级编程学习者来说是巨大的福音。它意味着 AI 可以帮助学生更快地入门，减少因语法错误和低级调试而产生的挫败感，从而提升学习兴趣和信心。
个性化学习的潜力： AI 工具能够提供实时反馈和定制化支持，这在大型课堂中尤为重要。它使得教育者能够将精力更多地投入到高阶思维培养和个性化指导上，而不是重复性的基础错误修正。
学生接受度高： 学生的积极感知表明 AI 工具具有很高的落地潜力。只要合理引导，它们可以成为学生学习编程的强大辅助。

7.3.2. 批判

尽管研究的结论令人鼓舞，但以下几点值得批判性思考和进一步探讨：

理解与表面学习的权衡： AI 工具对“学习成功和理解难易度”没有显著影响，这与“学生表现”的显著提升形成了对比。这可能意味着 AI 帮助学生“完成”了任务，但未必深化了他们对底层概念的“理解”。学生可能在 AI 的帮助下通过了测试，但他们的内化知识和独立解决问题的能力是否真的得到了同等程度的提升？这正是“过度依赖”的潜在风险，也是编程教育最需要避免的“认知惰性 (cognitive laziness)”。
异质性挑战与情境依赖： 研究中高异质性 (I² 值) 提示我们，AI 工具的有效性并非一概而论。其效果可能因工具类型、学生背景（如先验知识、学习风格）、任务复杂性、教师干预程度以及课程设计等因素而异。未来的研究需要更精细的亚组分析，以确定“在什么情境下，哪种 AI 工具对哪类学生最有效”。
长期影响的空白： 论文自身也承认，现有研究大多关注短期成果。编程技能的掌握是一个长期过程，涉及批判性思维、问题分解、算法设计等深层能力。如果 AI 导致学生过度依赖，长期可能削弱这些核心能力。因此，亟需开展长期的跟踪研究，评估 AI 对学生职业发展和终身学习能力的影响。
伦理与公平性问题： AI 引入的学术诚信、算法偏倚等伦理问题不容忽视。如何设计评估机制来区分学生自主完成和 AI 辅助完成的工作？如何确保 AI 工具不会加剧数字鸿沟或学习不公平？这些都是在推广 AI 辅助学习时必须审慎考虑的问题。
教师角色的演变： AI 工具的普及将深刻改变教师的角色。教师需要从知识的传授者转变为学习的设计者、引导者和 AI 工具的管理者。他们需要懂得如何将 AI 有效地融入课程，如何识别和纠正 AI 的局限性，并教育学生批判性地使用 AI。

总之，本研究为 AI 在编程教育中的应用提供了坚实的证据基础，但同时也敲响了警钟：AI 是强大的工具，而非万能的解决方案。其潜力的充分发挥，依赖于精心的教学设计、对学生认知过程的深刻理解，以及对伦理风险的持续关注和管理。

相似论文推荐

基于向量语义检索推荐的相关论文。

暂时没有找到相似论文。