-
Notifications
You must be signed in to change notification settings - Fork 1.2k
awin: document advertiser dashboard scraping #349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
CitizenZM
wants to merge
1
commit into
browser-use:main
Choose a base branch
from
CitizenZM:add-awin-domain-skill
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+202
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| # Awin (app.awin.com) — Advertiser dashboard scraping | ||
|
|
||
| Awin's advertiser dashboard is a Vue/React SPA hosted on `app.awin.com`, with auth on `id.awin.com`. KPI tiles, charts, and tables render asynchronously after the SPA boots, behind a cookie banner that blocks lazy load until dismissed. | ||
|
|
||
| ## URL patterns | ||
|
|
||
| | Page | URL | | ||
| |---|---| | ||
| | Login | `https://app.awin.com/login` (redirects to `id.awin.com/u/login/identifier?...`) | | ||
| | User home (account picker) | `https://ui.awin.com/user` | | ||
| | Advertiser home | `https://app.awin.com/en/awin/advertiser/{merchant_id}/home` | | ||
| | Publisher Performance report | `https://app.awin.com/en/awin/advertiser/{merchant_id}/reports/publisher-performance` | | ||
| | All partnerships | `https://app.awin.com/en/awin/advertiser/{merchant_id}/partnerships/all` | | ||
| | Commissions | `https://app.awin.com/en/awin/advertiser/{merchant_id}/commissions` | | ||
| | Campaigns (new UI) | `https://app.awin.com/en/awin/advertiser/{merchant_id}/campaigns` | | ||
|
|
||
| Merchant IDs are stable integers (5–7 digits) — read them off the URL after picking an account on `ui.awin.com/user`. The same advertiser brand may have separate IDs per region (US / EU / APAC). | ||
|
|
||
| ## Login flow | ||
|
|
||
| Two-step: email → Continue → password → Sign in. Note that after successful login the URL still contains `/login` for a moment (`id.awin.com/u/login/password?...` → `ui.awin.com/user`) — **detect success by visible text ("Your Accounts", "Manage Accounts", "Advertiser Reports"), not by URL.** | ||
|
|
||
| ```python | ||
| async () => { | ||
| // dismiss cookie banner first — it blocks lazy-loaded KPIs | ||
| const ck = [...document.querySelectorAll('button')].find(b => /accept all/i.test(b.textContent||'')); | ||
| if (ck) ck.click(); | ||
|
|
||
| const email = document.querySelector('input[type="email"], input[name="username"]'); | ||
| if (email) { email.focus(); email.value = EMAIL; | ||
| email.dispatchEvent(new Event('input', {bubbles:true})); | ||
| email.dispatchEvent(new Event('change', {bubbles:true})); | ||
| } | ||
| const cont = [...document.querySelectorAll('button')].find(b => /continue/i.test(b.textContent)); | ||
| if (cont) cont.click(); | ||
| // wait ~3s for password page transition | ||
| const pw = document.querySelector('input[type="password"]'); | ||
| if (pw) { pw.focus(); pw.value = PASSWORD; | ||
| pw.dispatchEvent(new Event('input', {bubbles:true})); | ||
| pw.dispatchEvent(new Event('change', {bubbles:true})); | ||
| } | ||
| const submit = [...document.querySelectorAll('button')].find(b => /sign in|log in|submit/i.test(b.textContent)); | ||
| if (submit) submit.click(); | ||
| } | ||
| ``` | ||
|
|
||
| ## The cookie-banner trap | ||
|
|
||
| If the cookie banner ("Cookies and privacy") is still visible, the advertiser home renders **only skeleton placeholders** — gray bars where KPI cards should be. `wait_for_load()` returns immediately because the SPA is "ready," but the actual data fetches are deferred until the banner is dismissed. Symptom: screenshot shows three loading dots and a sidebar full of gray rectangles. | ||
|
|
||
| **Always dismiss the banner before waiting for content.** Dismiss runs on every page visit, not just login — Awin re-shows it on some routes. | ||
|
|
||
| ## Skeleton-load polling pattern | ||
|
|
||
| `domcontentloaded` + a fixed `sleep(6)` is not enough. The home page can take 8–15s for KPI tiles to render. Poll for either: | ||
|
|
||
| 1. Skeleton placeholder count to drop below ~5: `[class*=skeleton],[class*=Skeleton],[class*=placeholder]` | ||
| 2. Specific KPI text to appear: `Revenue`, `Transactions`, `Clicks`, `Performance` | ||
|
|
||
| ```js | ||
| async () => { | ||
| const skel = document.querySelectorAll('[class*=skeleton],[class*=Skeleton],[class*=placeholder]').length; | ||
| const txt = document.body.innerText || ''; | ||
| return { skel, ready: skel < 5 && txt.length > 800 }; | ||
| } | ||
| ``` | ||
|
|
||
| Poll every 1s, max 45s. Also do a slow scroll to bottom + back to top — it triggers IntersectionObserver-driven lazy mounts for sections below the fold. | ||
|
|
||
| ## Where the real data lives | ||
|
|
||
| Awin renders KPIs in styled `<h1>`/`<strong>` blocks, NOT in `<table>` elements. The home page exposes everything in `document.body.innerText` in a predictable order: | ||
|
|
||
| ``` | ||
| <Advertiser Name> (<merchant_id>) | ||
| Home | ||
| Campaigns | ||
| ... | ||
| Revenue | ||
| <date> Yesterday | ||
| <currency><value> | ||
| <delta>% | ||
| Transactions | ||
| <date> Yesterday | ||
| <value> | ||
| <delta>% | ||
| Clicks | ||
| <date> Yesterday | ||
| <value> | ||
| <delta>% | ||
| ... | ||
| Revenue trend | ||
| Last 7 days | ||
| <currency><value> | ||
| <delta>% | ||
| ... | ||
| Top partners | ||
| <date> Yesterday | ||
| Chart | ||
| Bar chart with 5 bars. | ||
| ... | ||
| <currency><value><currency><value> ← value doubled with U+200B zero-width space between | ||
| <currency><value><currency><value> | ||
| ... | ||
| <currency>0 | ||
| <currency><axis> | ||
| <currency><axis> | ||
| <Partner 1 name> ← partner names in same order as bar values | ||
| <Partner 2 name> | ||
| <Partner 3 name> | ||
| <Partner 4 name> | ||
| <Partner 5 name> | ||
| See publisher performance report | ||
| ``` | ||
|
|
||
| **Regex extractors that work** (Python; currency in `$€£`): | ||
|
|
||
| ```python | ||
| # Yesterday tile (revenue/txns/clicks): | ||
| re.findall(r"(Revenue|Transactions|Clicks)\s+\w+\s+\d+\s+\d{4}\s+Yesterday\s+([$€£]?[\d,\.]+)\s+(-?[\d\.]+)%", raw) | ||
|
|
||
| # 7-day trend: | ||
| re.search(r"Revenue trend\s+Last 7 days\s+([$€£][\d,\.]+)\s+(-?[\d\.]+)%", raw) | ||
|
|
||
| # Top-5 bar chart values (note zero-width space U+200B between the duplicate): | ||
| re.findall(r"([$€£][\d,\.]+)[$€£][\d,\.]+", raw) | ||
| ``` | ||
|
|
||
| The Top-5 partner *names* sit between `End of interactive chart.` and `See publisher performance report` — split that slice by newline, drop axis labels (`$0`, `$400`, `$800`, `End of...`). | ||
|
|
||
| ## Publisher Performance page | ||
|
|
||
| `/reports/publisher-performance` renders an embedded Looker/BI iframe. The default view ships with **no date range applied** — `document.body.innerText` returns essentially just the page chrome ("Take a quick tour", "Need help? Ask Ava", "Date Last Refreshed - ...") plus an empty canvas. To get tabular data you must click into the date selector and the visualization first; even then most data is canvas-rendered and unreachable via DOM. | ||
|
|
||
| **Recommended workaround**: skip DOM scraping here. Either | ||
| 1. Use the full-page screenshot for visual evidence in the report, or | ||
| 2. Export the report via Awin's CSV download (button: "Export → CSV") — the URL is a signed S3 link, easy to grab via the network panel. | ||
|
|
||
| ## Partnerships ("All partnerships") page | ||
|
|
||
| `/partnerships/all` is the best source for publisher details — it's plain DOM, fully scrapable. Default sort is `Joined: Newest-to-oldest`, ~10 rows per page, ~197 pages for an established program (use the `1 2 3 ⋯ 197` pager). | ||
|
|
||
| Each row follows this exact `innerText` block — extractable with one regex: | ||
|
|
||
| ``` | ||
| <Publisher Name> | ||
| <numeric publisher id> | ||
| Status | ||
| Partners ← or "Pending" or "Left your program" | ||
| Website | ||
| <domain> | ||
| Primary promotional type | ||
| <type> ← may be empty string! | ||
| Primary sector | ||
| <sector> | ||
| Partners since ← or "Left on" | ||
| <Month DD, YYYY> | ||
| ``` | ||
|
|
||
| ```python | ||
| re.compile( | ||
| r"([A-Za-z][\w\s\.,&\-\(\)']{1,60})\n(\d{4,7})\nStatus\nPartners\n" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. P2: Partnership row regex is inconsistent with documented status/date variants, causing it to miss non-Partners rows (Pending, Left your program) and rows using Left on instead of Partners since. Prompt for AI agents |
||
| r"Website\n([^\n]+)\n" | ||
| r"Primary promotional type\n([^\n]*)\n" | ||
| r"Primary sector\n([^\n]*)\n" | ||
| r"Partners since\n([A-Z][a-z]{2,8} \d{1,2}, \d{4})" | ||
| ) | ||
| ``` | ||
|
|
||
| **Trap**: `Primary promotional type` can be blank (the line below it is just `\n`). Don't require non-empty — capture as `[^\n]*` not `[^\n]+`. Status can also be `Pending` (visible above `Your partnerships` count) or `Left your program` — those rows have `Left on` instead of `Partners since`. | ||
|
|
||
| ## Account picker (`ui.awin.com/user`) | ||
|
|
||
| After login, users with multiple advertiser accounts land here. The page lists each account with merchant ID. To jump straight to a specific advertiser, skip the picker and navigate directly to `https://app.awin.com/en/awin/advertiser/{merchant_id}/home` — Awin's auth carries across, no click required. | ||
|
|
||
| ## Network APIs (worth investigating, not yet documented) | ||
|
|
||
| The dashboard hits `https://app.awin.com/api/...` and `https://api.awin.com/...` endpoints with bearer tokens stored in `localStorage`. Direct API calls would be 10×+ faster than DOM scraping. Untested but visible XHRs: | ||
|
|
||
| - `GET /api/advertiser/{mid}/dashboard/kpi?period=yesterday` | ||
| - `GET /api/advertiser/{mid}/publishers?sort=joined_desc&page=1` | ||
|
|
||
| Next agent on this domain: drop into DevTools Network tab on a fresh dashboard load, copy the bearer header, and replay. If the bearer is in `localStorage` rather than an HttpOnly cookie, the scraper can grab it via `js("localStorage.getItem('access_token')")` and bulk-fetch. | ||
|
|
||
| ## Isolated profile pattern (concurrent with MCP browser) | ||
|
|
||
| The MCP Playwright server locks `~/Library/Caches/ms-playwright/mcp-chrome-*` exclusively. To run a second scraper concurrently without disturbing the user's active MCP session, launch your own persistent context with a different `user_data_dir`: | ||
|
|
||
| ```python | ||
| from playwright.sync_api import sync_playwright | ||
| PROFILE = Path.home() / ".cache" / "awin-isolated-profile" | ||
| PROFILE.mkdir(parents=True, exist_ok=True) | ||
| with sync_playwright() as p: | ||
| ctx = p.chromium.launch_persistent_context( | ||
| user_data_dir=str(PROFILE), | ||
| headless=True, | ||
| viewport={"width": 1600, "height": 1000}, | ||
| args=["--disable-blink-features=AutomationControlled", "--no-sandbox"], | ||
| ) | ||
| ``` | ||
|
|
||
| Persistent profile means session cookies survive between runs — login once, scrape many times. After the first successful login, subsequent runs land directly on the dashboard. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Login recipe documents waiting ~3s for password page transition but implements no wait, creating a race condition where password entry may be silently skipped.
Prompt for AI agents