A "Moon Landing Project" ReAct engine completed in the gaps between 4 coding tests on a Saturday.
Mainly includes:
- Single ReAct agent execution engine
- Observability
- Simple visualization analysis tool (planned to be improved later). Implemented features are in todo.md.
Future plans:
- Improve observability
- Enhance visualization analysis tools
- Perfect various schemes for detecting and handling circular thinking
- Improve test sets
- Enhance Evaluator & optimizer for self-improvement
- Improve distributed multi-agent architecture
- Enhance multiple context management mechanisms
python ReActAgent.pyInput/output references:
- ReAct Example
- ReAct with FS, webSearch, weather, datetime, python sandbox
- ReAct with observer and logger
Thinking manifests in two ways:
-
Circular thinking based on LLM output patterns,
- Advantages: Natural, no additional training required
- Disadvantages: More cycles, time-consuming
-
Circular thinking based on GIT-PLAN tools.
- Advantages: Fewer cycles, shorter time consumption
- Disadvantages: Dependent on additional training, uncontrollable number of cycles
For these two types of circular thinking, detection and handling schemes are proposed respectively:
-
Detection scheme for circular thinking based on LLM output patterns:
- Detection method: Analyze circular references, repeated statements, repeated tool parameters, etc., in LLM output.
- Handling schemes:
- Ensure accurate prompt returns on the tool side, for example: complex Python sandbox errors that the model cannot understand may lead to persistent error correction attempts.
- Ensure accurate prompts on the model side, for example: system prompts should avoid consuming大量 tokens on persistent error correction, but rather activate the model's thinking ability.
-
Detection scheme for circular thinking based on GIT-PLAN tools:
- Detection method: Analyze git diff and topology in GIT-PLAN tools output.
- Handling scheme: Adopt structured WBS for task decomposition and status marking to avoid circular dependencies.
Three indicators:
-
Task completion quality:
- Definition: The quality of task completion, including task completion degree, task quality, task efficiency, etc.
- Indicators: Task completion rate, task quality, task efficiency, etc.
-
Thinking quality:
- Definition: Coherence refers to whether the model can maintain logical coherence when generating text, avoiding logical errors or inconsistencies.
- Indicator: Analysis of plan update status
-
Data quality:
- Definition: Evaluation of the quality of observable data and factual data, including data completeness, accuracy, consistency, etc.
- Indicators: Missing value ratio, outlier ratio, data duplication ratio, etc.
Several dimensions related to token efficiency:
- Multi/single agent architecture
- Model selection
- Prompt engineering
- Tool design
- Task plan decomposition and update methods
- Context compression
- Context window size
At this stage, appropriate dimensions are selected for optimization according to specific scenarios.
Prompts are divided into general and vertical.
Vertical prompts:
- Definition: Customized prompts for specific task scenarios.
- Advantages:
- Customization: Customized prompts for specific task scenarios can better meet task requirements.
- Efficiency: Customized prompts can reduce the number of model inferences and improve efficiency.
- Disadvantages:
- Cost: Customized prompts require SOP customization for specific task scenarios, resulting in higher experiential learning costs.
- Disorder: When facing general problems, vertical domain prompts may cause model output disorder, inconsistency in model context, leading to model output errors and more prone to thinking errors:
- Infinite loop errors
- Function call errors
- Prompt design: termination tokens
- Prompt design: hand off token
- Detection of no tool calls:
- Definition: Detecting that the model has no tool calls in a loop, indicating that the task cannot be completed.
- Handling scheme: Terminate the ReAct loop and return the current model output.