Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions (#1) · Issues · Maribel Diehl / langdonconsulting

Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions

I ran a quick experiment investigating how DeepSeek-R1 carries out on agentic tasks, regardless of not supporting tool use natively, and I was quite pleased by preliminary outcomes. This experiment runs DeepSeek-R1 in a single-agent setup, where the model not just plans the actions but also formulates the actions as executable Python code. On a subset1 of the GAIA validation split, DeepSeek-R1 outperforms Claude 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% proper, and other designs by an even bigger margin:

The experiment followed design usage standards from the DeepSeek-R1 paper and the model card: Don't utilize few-shot examples, avoid including a system prompt, and set the temperature to 0.5 - 0.7 (0.6 was used). You can discover additional examination details here.

Approach

DeepSeek-R1's strong coding capabilities enable it to serve as an agent without being explicitly trained for tool use. By permitting the design to produce actions as Python code, it can flexibly interact with environments through code execution.

Tools are executed as Python code that is included straight in the timely. This can be a basic function definition or a module of a bigger plan - any valid Python code. The design then generates code actions that call these tools.

Arise from executing these actions feed back to the design as follow-up messages, driving the next actions till a last answer is reached. The agent structure is an easy iterative coding loop that moderates the discussion between the design and its environment.

Conversations

DeepSeek-R1 is utilized as chat model in my experiment, where the model autonomously pulls additional context from its environment by utilizing tools e.g. by utilizing a search engine or fetching data from web pages. This drives the conversation with the environment that continues up until a last answer is reached.

On the other hand, o1 designs are understood to carry out badly when utilized as chat models i.e. they do not attempt to pull context throughout a discussion. According to the connected article, o1 models carry out best when they have the full context available, with clear instructions on what to do with it.

Initially, I also tried a complete context in a single timely approach at each step (with arise from previous actions consisted of), however this led to considerably lower scores on the GAIA subset. Switching to the conversational method explained above, I was able to reach the reported 65.6% performance.

This raises an interesting concern about the claim that o1 isn't a chat model - perhaps this observation was more appropriate to older o1 models that did not have tool usage capabilities? After all, isn't tool usage support a crucial system for making it possible for designs to pull extra context from their environment? This conversational method certainly appears effective for DeepSeek-R1, though I still require to conduct comparable try outs o1 models.

Generalization

Although DeepSeek-R1 was mainly trained with RL on math and coding jobs, it is impressive that generalization to agentic tasks with tool use by means of code actions works so well. This ability to generalize to agentic jobs reminds of recent research by DeepMind that reveals that RL generalizes whereas SFT memorizes, although generalization to tool usage wasn't investigated because work.

Despite its ability to generalize to tool usage, DeepSeek-R1 often produces long thinking traces at each step, compared to other models in my experiments, restricting the effectiveness of this design in a single-agent setup. Even simpler jobs often take a long period of time to complete. Further RL on agentic tool usage, be it via code actions or not, might be one choice to enhance effectiveness.

Underthinking

I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning model frequently changes in between different reasoning ideas without adequately exploring promising paths to reach a right solution. This was a significant reason for overly long reasoning traces produced by DeepSeek-R1. This can be seen in the tape-recorded traces that are available for funsilo.date download.

Future experiments

Another common application of reasoning designs is to utilize them for planning just, while using other models for generating code actions. This could be a potential brand-new function of freeact, if this separation of roles shows beneficial for more complex jobs.

I'm likewise curious about how reasoning models that currently support tool use (like o1, o3, ...) carry out in a single-agent setup, with and without creating code actions. Recent advancements like OpenAI's Deep Research or open-source Deep Research, which also uses code actions, look fascinating.