This is my research area. I just finished reviewing six NeurIPS papers (myself, no LLM involved) on LLM Agents for discovery and generation and I'm finding that evaluating LLM agents on raw performance for a task isn't as insightful anymore -- every paper is claiming state of the art 10x performance boost by {insert random acronym that devolves into combinatorial search}. Rather the true test for such algorithms is whether the empirical scaling curves for these algorithms are more computationally amenable than an existing baseline search algorithm (like CoT).
Three motivating points:
- GEPA / evolutionary agents are performing a zero-th order (no gradient) optimization in a combinatorial space. Their loss curves are VERY noisy and stochastic. If we run such agents multiple times, the performance variance is extremely high -- and in some cases cancels out the gains from single experiment. However, obtaining the error bounds is hard because the API costs are pretty restrictive.
- The problem we face with test time scaling is not that prompt engineering is ineffective/less effective than fine-tuning. It is that fine-tuning _reliably_ increases performance for a model for any subset of tasks and the scaling curves for performance per additional data token are well understood.
- Test time optimization techniques work well on in-distribution problems (e.g. generate and debug this Python code) but fail pretty badly on even slightly out of distribution problems (e.g. generate and debug this Julia code). Compare this to gradient search -- it wouldve been so fascinating and confusing if SGD failed to optimize a CNN image classifier on COCO but worked very well on ImageNet.
How do people feel about this? Does this line up with your viewpoints?
- raw accuracy is now a "vanity" metric. so the benchmarks need to get more sophisticated, and i think they're going to have to be far more task specific than hotpot or hover. they've become like the mnist of multi hop.
- in my use of MIPROv2 and SIMBA, I see a fair amount of improvements for multi hop tasks (published some of these on hn before). I'm going to try GEPA and see how it performs. so I think we're at the start of what I would call "meta learning".. tuning across a huge search surface rather than tweaking one prompt. hyper param search for higher dim spaces.
I can't comment on your detailed knowledge of the state of the art, but your points resonate (particularly because I have tried to generate Julia and Lean code).
So, as with any less informed user reviewing LLM output, what you say definitely sounds plausible and correct.
To me that's a complete circle back to lessons learned from neural networks - I long suspected we'd be heading that way again with transformers being a computational step rather than the whole thing.
It has been commonly observed that the current crop of LLMs can be too agreeable/sycophantic (or on some topics, too disagreeable) due to the commonly chosen RLHF priorities.
Simply asking the LLM in two separate contexts the same question but from opposing perspectives, then in a third context asking it to analyze both responses and choose the most neutral and objective take, you wipe out any "(dis)agreeableness" bias and dig closer to a deeper, more nuanced synthesis of a given topic. This paper is just taking this idea to the next level.
This isn't really possible with RLHF alone unless you train the LLM to often give two opposing perspectives, which would get tiring.
Looking at a Problem from various perspectives, even posing ideas, is exactly what reasoning models seem to simulate in their thinking CoT to explore the solution space with optimizations like MCMC etc.
Self-distillation generally refers to training a smaller model, right? I suppose for full metacognition you would use it fine-tune the existing model based on its older self?
No, training a smaller model off a more capable larger model (or an ensemble of models) is the "usual" distillation.
"Self-distillation" refers to distilling from a model into a copy of itself. Which is of limited use - unless you can steer the teacher, and want the student to internalize that steering.
The reason for doing self-distillation here is that we have both access to a richer representation (logit stream), and want to capture a richer behavior - not the answers themselves, but better reasoning techniques that are downstream from better prompts.
self distillation and mutual distillation are used in MoE models. What you can do is freeze all but one expert and then train the model. If you want to do it again, you have to do self/mutual distillation to spread the training result onto the other experts.
> ArXivIQ exists to turn that fire-hose into jet-streams of insight: every paper is hand-picked by a human editor, then pushed through a purpose-built multi-agent AI pipeline that dissects methods, experiments, and limitations in minutes instead of hours.
This is my research area. I just finished reviewing six NeurIPS papers (myself, no LLM involved) on LLM Agents for discovery and generation and I'm finding that evaluating LLM agents on raw performance for a task isn't as insightful anymore -- every paper is claiming state of the art 10x performance boost by {insert random acronym that devolves into combinatorial search}. Rather the true test for such algorithms is whether the empirical scaling curves for these algorithms are more computationally amenable than an existing baseline search algorithm (like CoT).
Three motivating points:
- GEPA / evolutionary agents are performing a zero-th order (no gradient) optimization in a combinatorial space. Their loss curves are VERY noisy and stochastic. If we run such agents multiple times, the performance variance is extremely high -- and in some cases cancels out the gains from single experiment. However, obtaining the error bounds is hard because the API costs are pretty restrictive.
- The problem we face with test time scaling is not that prompt engineering is ineffective/less effective than fine-tuning. It is that fine-tuning _reliably_ increases performance for a model for any subset of tasks and the scaling curves for performance per additional data token are well understood.
- Test time optimization techniques work well on in-distribution problems (e.g. generate and debug this Python code) but fail pretty badly on even slightly out of distribution problems (e.g. generate and debug this Julia code). Compare this to gradient search -- it wouldve been so fascinating and confusing if SGD failed to optimize a CNN image classifier on COCO but worked very well on ImageNet.
How do people feel about this? Does this line up with your viewpoints?
mostly aligned on this. couple of thoughts:
- raw accuracy is now a "vanity" metric. so the benchmarks need to get more sophisticated, and i think they're going to have to be far more task specific than hotpot or hover. they've become like the mnist of multi hop.
- in my use of MIPROv2 and SIMBA, I see a fair amount of improvements for multi hop tasks (published some of these on hn before). I'm going to try GEPA and see how it performs. so I think we're at the start of what I would call "meta learning".. tuning across a huge search surface rather than tweaking one prompt. hyper param search for higher dim spaces.
- tokens burned should be a reported result
I can't comment on your detailed knowledge of the state of the art, but your points resonate (particularly because I have tried to generate Julia and Lean code).
So, as with any less informed user reviewing LLM output, what you say definitely sounds plausible and correct.
Do the problems you highlighted still appear with higher quality training data?
To me that's a complete circle back to lessons learned from neural networks - I long suspected we'd be heading that way again with transformers being a computational step rather than the whole thing.
anyone working on an dspy optimizer for this?
This is a DSPy optimizer, built by the DSPy core team. Just wait for open sourcing.
okhat is a great way to shorten your name gave me a good laugh
they’ve already written one! see omar’s x account for details!
Here's a link to a repost Omar made referencing it: https://x.com/DSPyOSS/status/1950733300420510006
These models / nets / whatever are much "smarter" (loaded term) than we think, we just don't know how to plug-in properly yet.
"We are not interested in the fact that the brain has the consistency of cold porridge." — Alan Turing
I guess Turing couldn’t see the trillions of base pairs of DNA, complex methylation states, dendritic spines of the neurons, etc., just for starters.
This is not how science and engineering work and an arxiv should not be taken at face value.
It has been commonly observed that the current crop of LLMs can be too agreeable/sycophantic (or on some topics, too disagreeable) due to the commonly chosen RLHF priorities.
Simply asking the LLM in two separate contexts the same question but from opposing perspectives, then in a third context asking it to analyze both responses and choose the most neutral and objective take, you wipe out any "(dis)agreeableness" bias and dig closer to a deeper, more nuanced synthesis of a given topic. This paper is just taking this idea to the next level.
This isn't really possible with RLHF alone unless you train the LLM to often give two opposing perspectives, which would get tiring.
Looking at a Problem from various perspectives, even posing ideas, is exactly what reasoning models seem to simulate in their thinking CoT to explore the solution space with optimizations like MCMC etc.
A sufficiently capable AI would be able to plug itself in properly too.
One more reason to be wary of pushing for better capabilities.
Is the name meant to be a jab at you know who or am I reading too much into it?
Why would they want to take a jab at Francis Kojo Kwarteng Arthur (Esq), the CEO of Ghana Export Promotion Authority?
I have no idea who you mean. Why not just write it?
vJEPA models, lecun's approach towards world models that have been derided by a lot of naysayers. (personally I think thats the way to go)
And then you use self-distillation to wire the improved prompts back into the LLM. Bam, free metacognitive skills.
Self-distillation generally refers to training a smaller model, right? I suppose for full metacognition you would use it fine-tune the existing model based on its older self?
No, training a smaller model off a more capable larger model (or an ensemble of models) is the "usual" distillation.
"Self-distillation" refers to distilling from a model into a copy of itself. Which is of limited use - unless you can steer the teacher, and want the student to internalize that steering.
The reason for doing self-distillation here is that we have both access to a richer representation (logit stream), and want to capture a richer behavior - not the answers themselves, but better reasoning techniques that are downstream from better prompts.
self distillation and mutual distillation are used in MoE models. What you can do is freeze all but one expert and then train the model. If you want to do it again, you have to do self/mutual distillation to spread the training result onto the other experts.
This summary article is LLM authored [1], but it does seem to make sense. HN folks apparently agree, with 58 points for the submission so far.
1: https://arxiviq.substack.com/p/coming-soon
> ArXivIQ exists to turn that fire-hose into jet-streams of insight: every paper is hand-picked by a human editor, then pushed through a purpose-built multi-agent AI pipeline that dissects methods, experiments, and limitations in minutes instead of hours.