Evaluating Performance of Finetuned Large Language Models for Mathematical and Code Tasks

Published on 2025-11-12 • Avichala Research

Evaluating Performance of Finetuned Large Language Models for Mathematical and Code Tasks

Abstract: This paper investigates the performance of various finetuned Large Language Models (LLMs) across mathematical and code reasoning tasks. Utilizing a diverse suite of models, including Qwen series, Archer, and Meta’s Llama 3, the research systematically evaluates their capabilities through a series of benchmark datasets – AMC8, AMC12, AIME, and others – highlighting both quantitative performance metrics and architectural variations (e.g., DeepScaleR, RLHF) to understand the impact of model design choices on task proficiency. The core focus is identifying optimal architectures and finetuning strategies for achieving state-of-the-art results in these demanding AI domains.

Problem Statement: The development of truly intelligent AI systems capable of reliably performing complex mathematical and code reasoning represents a significant bottleneck in broader AI applications. While foundational LLMs demonstrate impressive general language abilities, their performance dramatically degrades when confronted with structured problem-solving requiring logical deduction, algorithmic thinking, and precise execution. This research directly addresses the need to assess the effectiveness of finetuning strategies – particularly those incorporating architectural enhancements like DeepScaleR or reinforcement learning – in boosting LLM proficiency for these vital domains. The motivation stems from the growing demand for AI agents capable of automating complex tasks, generating code, or assisting in scientific discovery, all of which rely on robust reasoning capabilities. Furthermore, a more granular understanding of which architectural modifications improve specific reasoning types (mathematical vs. code) is crucial for targeted model development.

Methodology: The research employs a rigorous, multi-faceted experimental design. The core of the study involved finetuning a cohort of LLMs – Qwen-1.5B, Qwen3-8B, Qwen3-14B, Qwen3-30B, Archer-Code-1.5B, Meta’s Llama 3 series – on a comprehensive set of benchmark datasets. Datasets included AMC8, AMC12, AIME contests, and several coding challenges. The models were finetuned using a ‘Mixed’ training approach, leveraging both standard supervised finetuning and, in some cases, reinforcement learning based techniques (RLHF, like with the Meta Llama 3 models). Architectural variations were a key element. Models incorporating DeepScaleR – known for its ability to improve numerical reasoning – were explicitly tested alongside standard models. The researchers used different versions of the models, varying the finetuning approaches and the inclusion of techniques like ‘RLPO’ (Reinforcement Learning from Preferences Optimization). Performance was quantified using standard metrics – accuracy for mathematical problems, and success rate for coding tasks – alongside detailed analysis of the model’s output to understand the types of errors it was making. Crucially, the experiment examined the impact of different model ‘Layers’ and ‘MLP’ configurations (using MLP-down and MLP-up architectures), suggesting an attempt to identify the most efficient network structures for these tasks. Finally, there are investigations on the inclusion of different 'Mode Masks' and 'Random' training approaches, alongside comparisons of these techniques against the 'Mc princ' and 'Mlow' approaches.

Findings & Results: The research revealed significant variations in performance across the tested LLMs. Models incorporating DeepScaleR consistently outperformed standard finetuned models on mathematical problems, particularly when using the ‘Mc princ’ and ‘Mlow’ techniques. The Qwen3-8B and Qwen3-30B series, particularly when finetuned with RLPO, showed strong performance across both mathematical and coding tasks. The ‘Layer’ and ‘MLP’ configurations also had a substantial impact; the MLP-down architecture demonstrated particularly strong performance in the ‘Mc princ’ training regimen. Notably, the inclusion of reinforcement learning, particularly with Meta’s Llama 3-Instruct models, resulted in a noticeable improvement in both accuracy and code generation capabilities. Furthermore, the data sparsity, utilizing the DS-R1-Distill-Qwen-1.5B model, showed promising results with a focus on minimizing redundancy and training efficiency. The use of the 'Random' training approach, as an alternative to the 'Mc princ' and 'Mlow' methods, also yielded encouraging results, suggesting flexibility in training strategies.

Limitations: The study's limitations primarily stem from the breadth of the experimental scope. The assessment focused on a specific set of benchmark datasets and models, potentially limiting the generalizability of the findings. The evaluation did not systematically investigate the impact of different prompt engineering strategies, which could significantly affect model performance. Further, the research relied heavily on the 'Mixed' finetuning approach, which might not represent the most efficient training paradigm for all models and datasets. The depth of analysis regarding the root causes of error, beyond simply reporting accuracy rates, could have been expanded. The investigation of the ‘Mode Masks’ technique requires further research to assess its true effect.

Future Work & Outlook: Future research directions should prioritize exploring different prompt engineering techniques to optimize model responses, especially for complex reasoning problems. Investigating the robustness of these findings across a wider range of mathematical and coding domains, including more challenging contests and real-world applications, is crucial. Developing automated methods for prompt optimization, potentially utilizing techniques from automated machine learning (AutoML), could dramatically improve performance. Further research should also focus on understanding and mitigating the biases present in the training data and developing methods to ensure fairness and reliability of LLMs in these sensitive domains. The exploration of multimodal learning – combining textual prompts with visual or numerical input – represents a particularly promising avenue for advancement, potentially unlocking even greater problem-solving capabilities. The study’s results provide a valuable benchmark for future LLM development, particularly highlighting the importance of architectural innovations like DeepScaleR and effective finetuning strategies for specialized AI agent design.

Avichala Commentary: This research underscores the growing importance of specialized LLM finetuning for advanced reasoning tasks. It builds upon the evolution of Large Language Models, moving beyond general-purpose capabilities towards intelligent agents. The experiments with different architectural components and training approaches mirror the broader trend of customizing models for specific applications. As LLMs continue to evolve, the ability to efficiently adapt them for domains like mathematics and code will be a defining factor in their widespread adoption and impact across numerous industries, including scientific research, software development, and automation. It’s a significant step towards creating genuinely intelligent systems capable of tackling complex, real-world problems.