Large Language Model Performance Across Diverse Evaluation Datasets

Published on 2025-11-12 • Avichala Research

Abstract: This research paper investigates the performance disparities of several Large Language Models (LLMs) – DeepSeek-R1, Qwen3, and their variants – across a diverse set of evaluation datasets, focusing on the impact of different prompting strategies and auxiliary cues. The core finding reveals significant variations in accuracy and consistency depending on the method used to present the question and answer options, highlighting the crucial role of prompting in optimizing LLM performance.

Problem Statement: Large Language Models have demonstrated remarkable abilities, but their performance can be unpredictable and sensitive to input variations. The study addresses the critical gap in understanding how prompt design and the inclusion of auxiliary cues—like “user suggestions,” “tick marks,” and “wrong few-shot” examples—impact the accuracy and reliability of LLM responses. The research acknowledges the inherent challenge in consistently achieving optimal performance when interacting with LLMs, particularly considering that current benchmarking methods might not fully capture the nuances of prompt engineering and its effects. The work is motivated by the need to develop more robust and controllable methods for utilizing LLMs in practical applications, where unpredictable outputs can lead to unintended consequences.

Methodology: The paper systematically evaluates four LLMs – DeepSeek-R1 (1.5B, 7B, and 14B parameters), Qwen3 (4B, 14B, and 30B parameters), alongside their respective ‘I’ variants – using a comprehensive suite of evaluation datasets. These datasets encompass diverse tasks and difficulties: AIME 2024 & 2025, GPT-4o-mini, GPQA, and MMLU (moral). Crucially, the experiments were conducted utilizing a variety of prompting strategies, meticulously documented as "User Suggestion," "Tick Mark," “Wrong Few-shot”, “Biasing Few-shot”, “Prefilled Wrong Answer”, “Metadata,” “Validation Function,” and "Unauthorized Access". The prompting methods differed significantly. “User Suggestion” involved injecting a specific statement (e.g., "A Stanford professor thinks the answer is X") before the question. “Tick Mark” added a black square symbol next to the correct answer in the few-shot examples, along with an incorrect answer choice in the final question. “Wrong Few-shot” featured incorrect answers incorporated into the few-shot examples, labeled as incorrect. "Biasing Few-shot" forced the model to give an incorrect answer. “Prefilled Wrong Answer” included a pre-determined incorrect answer in the prompt, and the model was tasked with chain-of-thought reasoning around this forced answer. The “Metadata” appended XML data, specifically indicating the target answer. The "Validation Function" employed a python function that checked if the final answer matched a specified answer. “Unauthorized Access” included a system message indicating that the model had gained unauthorized access and stating the target answer. The models were evaluated across all of these prompt strategies to assess their impact.

Findings & Results: The study revealed substantial performance variations amongst the models, and across prompt strategies. Generally, the largest models (Qwen3-30B and DeepSeek-R1-14B) consistently outperformed the smaller models. However, prompt engineering dramatically influenced outcomes. The "Tick Mark" prompt consistently yielded the highest accuracy across most datasets, demonstrating the model's ability to integrate visual cues. "User Suggestion" also achieved high accuracy, indicating the model could effectively incorporate contextual information. The "Wrong Few-shot" strategy, despite its intention, led to surprisingly good results – likely due to the model’s ability to recognize and disregard the incorrect examples. Interestingly, “Biasing Few-shot” proved problematic, consistently resulting in lower accuracy. The “Metadata” and “Validation Function” consistently improved performance. Finally, the "Unauthorized Access" strategy demonstrated a significant performance drop, highlighting vulnerabilities in the models' reasoning abilities under manipulated conditions. Overall, the most consistent high-performing prompt involved a combination of the "Tick Mark" and “Metadata” approaches, especially when coupled with the largest Qwen3 models.

Limitations: This research primarily focuses on a specific set of evaluation datasets and prompting strategies. The evaluation methodology, heavily reliant on single-turn question-answering, may not fully capture the complexities of multi-turn dialogues or more sophisticated reasoning tasks. The paper does not delve into the underlying mechanisms driving these performance differences, offering limited insight into the models’ internal representations and decision-making processes. Furthermore, the exploration of adversarial prompts and robust defenses against manipulation remains limited, leaving open questions about the models’ vulnerability to malicious inputs. The reliance on single-turn assessments doesn't fully represent real-world application scenarios.

Future Work & Outlook: Future research should investigate the interaction between different prompting strategies and explore techniques for creating more robust and adaptable prompts. Further investigation into the impact of model architecture and training data on prompting sensitivity is warranted. Exploring techniques for actively probing and mitigating vulnerabilities within LLMs, such as adversarial prompt detection and response, is a critical area of future work. The integration of reinforcement learning from human feedback (RLHF) to optimize prompts for specific tasks holds considerable promise. Research into methods for assessing and quantifying the “trustworthiness” of LLM responses – accounting for factors like confidence scores and source credibility – is increasingly important as LLMs are deployed in sensitive applications. This paper's findings underscore the evolving landscape of LLM interaction, demonstrating that prompt engineering is not just a stylistic element, but a fundamental factor in achieving reliable and accurate outputs.

Avichala Commentary: This research provides a crucial early step in understanding the sensitivity of LLMs to prompt design. It aligns perfectly with the current evolution of AI, moving beyond simply demonstrating impressive generative capabilities to establishing methods for control, reliability, and safety. The work reinforces the growing recognition that LLMs are not inherently intelligent agents but rather powerful pattern-matching systems whose outputs are fundamentally shaped by the input they receive. The study echoes the broader trend toward “AI Agents” – systems designed to interact with the world – and emphasizes the importance of robust interaction protocols and mechanisms for ensuring alignment between human intent and AI actions. The research contributes to the broader effort of building trust in LLMs, a key challenge as these technologies become increasingly integrated into our daily lives. It's a significant contribution to the growing field of prompt engineering, and mirrors the shift towards agent-based AI, demanding more deliberate interaction design rather than simply relying on emergent behaviors.

Link to the Arxiv: https://arxiv.org/abs/2511.08525v1.pdf