Add loss analysis pipeline for autorater CSVs#1
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
| def call_openai(model: str, prompt: str) -> str: | ||
| if openai is None: | ||
| raise RuntimeError( | ||
| "openai package is not installed. Install openai>=0.27.0 or run with --dry-run." | ||
| ) | ||
| response = openai.ChatCompletion.create( | ||
| model=model, | ||
| temperature=0.0, | ||
| messages=[ | ||
| { | ||
| "role": "system", | ||
| "content": "You classify agent errors into predefined categories.", | ||
| }, | ||
| {"role": "user", "content": prompt}, | ||
| ], | ||
| ) | ||
| return response["choices"][0]["message"]["content"] |
There was a problem hiding this comment.
Use OpenAI v1 client instead of removed ChatCompletion API
The dependency instructions recommend pip install openai, which today installs the v1+ SDK. In that version the legacy openai.ChatCompletion.create entry point is removed, so call_openai raises AttributeError before any work is done. Either pin the older 0.x series or migrate to the new OpenAI() client (client.chat.completions.create). As written the script crashes immediately for anyone following the README.
Useful? React with 👍 / 👎.
| def build_prompt(categories: Sequence[Category], rows: pd.DataFrame) -> str: | ||
| category_lines = [ | ||
| f"- {cat.name}: {cat.description}" if cat.description else f"- {cat.name}" | ||
| for cat in categories | ||
| ] | ||
| rows_payload = rows[ | ||
| [ | ||
| "task_id", | ||
| "step_index", | ||
| "website_issue", | ||
| "screenshot_description_correct", | ||
| "thought_reasonable", | ||
| "action_matches_thought", | ||
| "incorrect_coordinates", | ||
| "issue_summary", | ||
| "autorater_failure", | ||
| "g_lab_url", | ||
| ] | ||
| ] | ||
| serialised_rows = json.dumps(rows_payload.to_dict(orient="records"), ensure_ascii=False) | ||
| prompt = textwrap.dedent( | ||
| f""" | ||
| You are an expert analyst labelling loss types for a web browsing agent. | ||
| You will receive {len(rows)} log entries, each corresponding to an agent | ||
| step that contained an issue. For each entry choose exactly one category | ||
| from the list provided below and return a JSON array with {len(rows)} | ||
| objects in the same order. Each object must contain the keys | ||
| "task_id", "step_index", "category", and "explanation". The | ||
| explanation should be a short (<=20 word) reason for your choice. | ||
|
|
||
| Categories: | ||
| {os.linesep.join(category_lines)} | ||
|
|
||
| Input rows (JSON array): | ||
| {serialised_rows} | ||
| """ | ||
| ).strip() | ||
| return prompt | ||
|
|
||
|
|
||
| def call_openai(model: str, prompt: str) -> str: | ||
| if openai is None: | ||
| raise RuntimeError( | ||
| "openai package is not installed. Install openai>=0.27.0 or run with --dry-run." | ||
| ) | ||
| response = openai.ChatCompletion.create( | ||
| model=model, | ||
| temperature=0.0, | ||
| messages=[ | ||
| { | ||
| "role": "system", | ||
| "content": "You classify agent errors into predefined categories.", | ||
| }, | ||
| {"role": "user", "content": prompt}, | ||
| ], | ||
| ) | ||
| return response["choices"][0]["message"]["content"] | ||
|
|
||
|
|
||
| def parse_model_output(text: str, expected_len: int) -> List[Dict[str, str]]: | ||
| try: | ||
| data = json.loads(text) | ||
| except json.JSONDecodeError as exc: | ||
| raise ValueError(f"Model output is not valid JSON: {exc}\n{text}") from exc |
There was a problem hiding this comment.
Prompt allows non‑JSON wrapping that parse_model_output cannot handle
parse_model_output blindly feeds the model reply into json.loads, but the prompt in build_prompt never instructs the model to emit raw JSON only. Common chat completions wrap results in prose or a ```json code fence, which will cause json.loads to raise and terminate the pipeline even though the model classified correctly. The prompt should explicitly require a pure JSON array with no extra text or the parser should strip code fences before decoding.
Useful? React with 👍 / 👎.
Summary
Testing
https://chatgpt.com/codex/tasks/task_e_68e5b87712f883329ae943bce47a9a38