Skip to content

Add loss analysis pipeline for autorater CSVs#1

Open
mquirosbloch wants to merge 1 commit intomasterfrom
codex/build-data-insights-generator-for-loss-analysis
Open

Add loss analysis pipeline for autorater CSVs#1
mquirosbloch wants to merge 1 commit intomasterfrom
codex/build-data-insights-generator-for-loss-analysis

Conversation

@mquirosbloch
Copy link
Owner

Summary

  • add a loss analysis script that batches autorater CSV rows and classifies them with an OpenAI chat model
  • aggregate category counts across models and generate a comparison chart
  • document usage requirements and workflow in the repository README

Testing

  • python loss_analysis.py --help (fails: ModuleNotFoundError: No module named 'matplotlib')

https://chatgpt.com/codex/tasks/task_e_68e5b87712f883329ae943bce47a9a38

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Comment on lines +138 to +154
def call_openai(model: str, prompt: str) -> str:
if openai is None:
raise RuntimeError(
"openai package is not installed. Install openai>=0.27.0 or run with --dry-run."
)
response = openai.ChatCompletion.create(
model=model,
temperature=0.0,
messages=[
{
"role": "system",
"content": "You classify agent errors into predefined categories.",
},
{"role": "user", "content": prompt},
],
)
return response["choices"][0]["message"]["content"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use OpenAI v1 client instead of removed ChatCompletion API

The dependency instructions recommend pip install openai, which today installs the v1+ SDK. In that version the legacy openai.ChatCompletion.create entry point is removed, so call_openai raises AttributeError before any work is done. Either pin the older 0.x series or migrate to the new OpenAI() client (client.chat.completions.create). As written the script crashes immediately for anyone following the README.

Useful? React with 👍 / 👎.

Comment on lines +98 to +161
def build_prompt(categories: Sequence[Category], rows: pd.DataFrame) -> str:
category_lines = [
f"- {cat.name}: {cat.description}" if cat.description else f"- {cat.name}"
for cat in categories
]
rows_payload = rows[
[
"task_id",
"step_index",
"website_issue",
"screenshot_description_correct",
"thought_reasonable",
"action_matches_thought",
"incorrect_coordinates",
"issue_summary",
"autorater_failure",
"g_lab_url",
]
]
serialised_rows = json.dumps(rows_payload.to_dict(orient="records"), ensure_ascii=False)
prompt = textwrap.dedent(
f"""
You are an expert analyst labelling loss types for a web browsing agent.
You will receive {len(rows)} log entries, each corresponding to an agent
step that contained an issue. For each entry choose exactly one category
from the list provided below and return a JSON array with {len(rows)}
objects in the same order. Each object must contain the keys
"task_id", "step_index", "category", and "explanation". The
explanation should be a short (<=20 word) reason for your choice.

Categories:
{os.linesep.join(category_lines)}

Input rows (JSON array):
{serialised_rows}
"""
).strip()
return prompt


def call_openai(model: str, prompt: str) -> str:
if openai is None:
raise RuntimeError(
"openai package is not installed. Install openai>=0.27.0 or run with --dry-run."
)
response = openai.ChatCompletion.create(
model=model,
temperature=0.0,
messages=[
{
"role": "system",
"content": "You classify agent errors into predefined categories.",
},
{"role": "user", "content": prompt},
],
)
return response["choices"][0]["message"]["content"]


def parse_model_output(text: str, expected_len: int) -> List[Dict[str, str]]:
try:
data = json.loads(text)
except json.JSONDecodeError as exc:
raise ValueError(f"Model output is not valid JSON: {exc}\n{text}") from exc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prompt allows non‑JSON wrapping that parse_model_output cannot handle

parse_model_output blindly feeds the model reply into json.loads, but the prompt in build_prompt never instructs the model to emit raw JSON only. Common chat completions wrap results in prose or a ```json code fence, which will cause json.loads to raise and terminate the pipeline even though the model classified correctly. The prompt should explicitly require a pure JSON array with no extra text or the parser should strip code fences before decoding.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant