Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 39 additions & 1 deletion skills/opencli-adapter-author/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,39 @@ allowed-tools: Bash(opencli:*), Read, Edit, Write, Grep

## 顶层决策树

**先定 strategy,再写 adapter。** 每次进入 Step 3/4 后、写代码前,必须产出一段 strategy note。没有这段 note,不要开始写 `clis/<site>/<name>.js`。

核心判断不是 "API 比 DOM 高级",而是 **数据源有没有外部契约**。实测维护成本显示:公开/官方接口最稳;UI/DOM 语义通常也有用户可见契约;站内未文档化 XHR/GraphQL/signature endpoint 最容易漂。不要为了 "API-first" 把稳定的 UI/DOM 实现盲目迁到无契约内部接口。

```md
Strategy: PUBLIC_API | COOKIE_API | PAGE_FETCH | INTERCEPT | DOM_STATE | UI_SELECTOR
Contract: stable | visible-ui | internal-unstable
Evidence:
- observed request/state: <endpoint / state global / UI-only signal>
- auth source: <none / browser cookie / csrf from meta / localStorage / page runtime>
- replay result: <status + content-type + non-empty sample shape>

If Strategy is PAGE_FETCH or INTERCEPT:
- why PUBLIC_API / COOKIE_API are unavailable:
- why UI_SELECTOR / DOM_STATE are not safer:
- why the maintenance cost is acceptable:
```

Strategy classes:

| Strategy | 契约级别 | 用在什么时候 | 证据要求 |
|---|---|---|---|
| `PUBLIC_API` | stable | 不需要登录,Node-side `fetch` 直接拿到目标数据 | 200 + JSON/HTML 含目标数据,不是埋点/广告 |
| `COOKIE_API` | stable | Node-side `fetch` + `page.getCookies()` / header helper 能拿数据 | cookie/CSRF 来源清楚,replay 非空 |
| `UI_SELECTOR` | visible-ui | publish/upload/click/表单,或页面语义比内部接口更稳 | selector 有语义锚点;错误路径是 typed error |
| `DOM_STATE` | visible-ui | 数据在 hydration state / bootstrap JSON / SSR HTML 里 | state key / script JSON / HTML 结构明确 |
| `PAGE_FETCH` | internal-unstable | 只能在页面上下文 `fetch` 才能复用 same-origin/session/runtime | `opencli browser eval fetch(...)` 非空;必须解释为什么避不开内部接口 |
| `INTERCEPT` | internal-unstable | 请求签名复杂,但页面自己能自然发出请求 | 触发 UI 后能截到目标 response;必须解释为什么 UI/DOM 不够 |

选择规则:优先 `PUBLIC_API` / `COOKIE_API`。如果 UI/DOM 语义稳定,不要强行升级到 `PAGE_FETCH` / `INTERCEPT`。只有公开/官方接口不可用、UI/DOM 无法表达目标数据或操作时,才承担无契约内部接口的维护成本。

边界:只复用页面自己已经合法获得的数据/能力。不教破解签名、不绕验证码/风控/访问控制;遇到不可复用签名(如必须由页面 runtime 生成且不能安全抽象)就降级到 `UI_SELECTOR` / `DOM_STATE` / `INTERCEPT`。

```
START
Expand Down Expand Up @@ -123,7 +156,12 @@ DONE
[ ] 5. 直接 fetch 候选 endpoint 验证:
[ ] 返回 200
[ ] 响应含目标数据(不是 HTML / 广告)
[ ] 6. 定鉴权策略:裸 fetch 通 → PUBLIC;要 cookie / token / csrf → COOKIE;拿不到签名 → INTERCEPT;只能点 UI → UI
[ ] 6. 写 strategy note(写代码前的强制产物):
[ ] 从 `PUBLIC_API / COOKIE_API / PAGE_FETCH / INTERCEPT / DOM_STATE / UI_SELECTOR` 选一个
[ ] 填 Contract:`stable / visible-ui / internal-unstable`
[ ] 填 Evidence:observed request/state、auth source、replay result
[ ] 如果选 `PAGE_FETCH` / `INTERCEPT`,必须解释为什么 `PUBLIC_API` / `COOKIE_API` / `UI_SELECTOR` / `DOM_STATE` 都不适合
[ ] 如果选 `UI_SELECTOR` / `DOM_STATE`,不需要为 "为什么不是 API" 过度辩护;只要说明语义锚点和 typed error 路径
[ ] 7. 字段解码:
[ ] 自解释 → 直接用 key
[ ] 已知代号 → field-conventions.md 查表
Expand Down
9 changes: 9 additions & 0 deletions skills/opencli-adapter-author/references/api-discovery.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,15 @@ opencli browser network

静态资源 / 埋点 / 追踪默认已过滤。默认会保留 JSON / XML / plain text / `text/javascript` 这类 API 响应;如果你确定浏览器 DevTools 里有目标请求但这里缺失,用 `--all` 查一遍是否被 content-type 或 URL 噪音过滤挡掉。

如果是冷启动,先看 `opencli browser analyze <url>` 里的 `api_candidates`:

- `verdict: "likely_data"`:优先 replay 这条,拿 status / content-type / sample shape 填 strategy note
- `verdict: "maybe_data"`:可以试,但必须人工核对字段是否是目标业务数据
- `verdict: "noise"`:多半是 analytics / beacon / personalization,不要因为 XHR 数量多就判 Pattern A
- `verdict: "blocked"`:401/403;先排 cookie / token / CSRF,别直接退到 selector

`real_data_score` 是证据,不是自动 strategy。最终仍要在 strategy note 里写 replay 结果和降级理由。

### 按 shape 初筛

挑 `key` 里含业务词(`list / detail / Timeline / User / Tweets / Quote`)的优先看 `shape`:
Expand Down
55 changes: 53 additions & 2 deletions src/browser/analyze.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import {
detectAntiBot,
classifyPattern,
findNearestAdapter,
scoreEndpointEvidence,
type PageSignals,
} from './analyze.js';
import type { CliCommand } from '../registry.js';
Expand Down Expand Up @@ -87,13 +88,29 @@ describe('classifyPattern', () => {
const v = classifyPattern(
mkSignals({
networkEntries: [
{ url: 'https://x.com/api/a', status: 200, contentType: 'application/json', bodyPreview: '{}' },
{ url: 'https://x.com/api/b', status: 200, contentType: 'application/json;charset=utf-8', bodyPreview: '{}' },
{ url: 'https://x.com/api/a', status: 200, contentType: 'application/json', bodyPreview: '{"items":[{"title":"A","id":"1"}]}' },
{ url: 'https://x.com/api/b', status: 200, contentType: 'application/json;charset=utf-8', bodyPreview: '{"data":{"results":[{"name":"B","url":"/b"}]}}' },
],
}),
);
expect(v.pattern).toBe('A');
expect(v.json_responses).toBe(2);
expect(v.real_data_candidates).toBe(2);
});

it('does not call analytics JSON a real API pattern', () => {
const v = classifyPattern(
mkSignals({
networkEntries: [
{ url: 'https://x.com/analytics/collect', status: 200, contentType: 'application/json', bodyPreview: '{"event":"view","clientId":"abc","experiment":"A"}' },
{ url: 'https://x.com/personalization', status: 200, contentType: 'application/json', bodyPreview: '{"sessionId":"s1","metrics":{"latency":12}}' },
],
}),
);
expect(v.pattern).toBe('C');
expect(v.json_responses).toBe(2);
expect(v.real_data_candidates).toBe(0);
expect(v.reason).toMatch(/telemetry|side-channel/);
});

it('returns B when __INITIAL_STATE__ is present, beating JSON signals', () => {
Expand Down Expand Up @@ -127,6 +144,40 @@ describe('classifyPattern', () => {
});
});

describe('scoreEndpointEvidence', () => {
it('scores non-empty business JSON above telemetry side-channel JSON', () => {
const data = scoreEndpointEvidence({
url: 'https://x.com/api/search',
status: 200,
contentType: 'application/json',
bodyPreview: '{"data":{"items":[{"title":"A","price":12,"url":"/a"}],"total":1}}',
});
const telemetry = scoreEndpointEvidence({
url: 'https://x.com/analytics/collect',
status: 200,
contentType: 'application/json',
bodyPreview: '{"event":"view","clientId":"abc"}',
});

expect(data.verdict).toBe('likely_data');
expect(data.real_data_score).toBeGreaterThan(telemetry.real_data_score);
expect(data.sample_paths).toContain('$.data.items:array(1)');
expect(telemetry.verdict).toBe('noise');
});

it('marks auth-gated JSON as blocked rather than data', () => {
const evidence = scoreEndpointEvidence({
url: 'https://x.com/api/private',
status: 403,
contentType: 'application/json',
bodyPreview: '{"error":"forbidden"}',
});

expect(evidence.verdict).toBe('blocked');
expect(evidence.real_data_score).toBeLessThan(0.1);
});
});

describe('findNearestAdapter', () => {
it('matches by domain suffix', () => {
const reg = new Map<string, CliCommand>([
Expand Down
Loading
Loading