-
Notifications
You must be signed in to change notification settings - Fork 1.2k
linkedin: domain skills for send-connection-request + connection-search-scrape #348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
frumza
wants to merge
1
commit into
browser-use:main
Choose a base branch
from
frumza:feat/linkedin-domain-skills
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+409
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,259 @@ | ||
| # LinkedIn — Scrape someone's connections list via keyword-filtered People search | ||
|
|
||
| Mine a target user's network (1st-degree connections, plus optionally 2nd-degree via the asker's graph) by paginating the People-search results with the `connectionOf` filter pre-applied. Keyword sweep beats blanket-scrape — narrows from thousands of connections to ~hundreds of relevant hits, and stays well under LinkedIn's anti-scrape thresholds. | ||
|
|
||
| ## When to use this | ||
|
|
||
| Use this when you have a 1st-degree connection (e.g. a parent, partner, or co-founder) whose own network you want to mine for warm-intro candidates. LinkedIn's People-search lets you filter by "Connections of X" — that returns the intersection of (your 1st+2nd graph) ∩ (X's 1st-degree connections). For a parent or close colleague the overlap is usually broad enough to be useful. | ||
|
|
||
| Don't use this for full graph dumps — use the Voyager profile API (`/voyager/api/identity/dash/profiles?q=memberIdentity`) for that. This skill is for **targeted relevance search across someone else's network.** | ||
|
|
||
| ## The URL pattern | ||
|
|
||
| ``` | ||
| https://www.linkedin.com/search/results/people/ | ||
| ?connectionOf=%5B%22<MEMBER_URN>%22%5D | ||
| &network=%5B%22F%22%2C%22S%22%5D | ||
| &keywords=<KEYWORD> | ||
| &page=<N> | ||
| &origin=MEMBER_PROFILE_CANNED_SEARCH | ||
| ``` | ||
|
|
||
| - `connectionOf` is a JSON-array-encoded list of `MEMBER_URN` strings (just one URN in practice). Format: `ACoAA...` — get it from the target's `/in/<handle>/` profile page (see "Getting the member URN" below). | ||
| - `network=["F","S"]` = 1st + 2nd degree (relative to the *asker's* graph). Drop `"F"` to exclude direct connections; drop `"S"` to scope to direct only. | ||
| - `keywords=<KW>` is a free-text filter applied across name, headline, current company, past company, education. The same single-keyword caveats apply as anywhere on LinkedIn — narrow keywords beat broad ones. | ||
| - `page=<N>` is 1-indexed. LinkedIn caps People-search at **100 pages × 10 results = 1000 max per query**. Plan keyword sweeps so each keyword returns <1000 hits. | ||
| - `origin=MEMBER_PROFILE_CANNED_SEARCH` is what LinkedIn sets when the user clicks "See all" on a profile's connections — using it identifies the search shape as a legitimate UI flow and reduces friction. | ||
|
|
||
| ## Getting the member URN | ||
|
|
||
| Open the target's profile (`/in/<handle>/`) while logged in, then run: | ||
|
|
||
| ```python | ||
| js("""(() => { | ||
| // The URN is embedded in many places. Most reliable: | ||
| // 1. Look for a 'urn:li:fsd_profile:ACoAA...' in any DOM attribute | ||
| const m = document.body.innerHTML.match(/urn:li:fsd_profile:([\\w-]+)/); | ||
| return m ? m[1] : null; | ||
| })()""") | ||
| ``` | ||
|
|
||
| Returns the URN string (`ACoAA...` plus ~30 characters). Use that as `<MEMBER_URN>` in the search URL. | ||
|
|
||
| ## Result-card extraction — the key DOM pattern | ||
|
|
||
| LinkedIn's 2026 People-search DOM uses obfuscated CSS-module class names (e.g. `_83309bd4 _6e63fa0b d343d86c`) that rotate per build. **Don't selector-hunt for cards** — go straight to the profile anchors and use their `innerText` to read the whole card payload. | ||
|
|
||
| Each result card has at least one anchor pointing to `/in/<handle>/` whose `innerText` contains the full card payload joined by `\n`: | ||
|
|
||
| ``` | ||
| <Name> • <Degree> | ||
|
|
||
| <Headline> | ||
|
|
||
| <Location> | ||
|
|
||
| Connect | ||
|
|
||
| [Past: ...] | [Summary: ...] | [Current: ...] (optional) | ||
|
|
||
| <Mutuals> e.g. "<Mutual Name> and N other mutual connections" | ||
| ``` | ||
|
|
||
| Multiple anchors point at the same profile (image link + name link + headline link); dedupe by `URL.pathname`. | ||
|
|
||
| The minimum filter: | ||
|
|
||
| ```python | ||
| items = js("""(() => { | ||
| const anchors = Array.from(document.querySelectorAll('a[href*="/in/"]')); | ||
| const out = []; | ||
| const seen = new Set(); | ||
| for (const a of anchors) { | ||
| const u = new URL(a.href); | ||
| if (!u.pathname.match(/^\\/in\\/[^/]+\\/?$/)) continue; // strict /in/<handle> only | ||
| if (!a.innerText.includes('•')) continue; // skip thumbnail-only links | ||
| const href = u.pathname.replace(/\\/$/, ''); | ||
| if (seen.has(href)) continue; | ||
| seen.add(href); | ||
| out.push({href, text: a.innerText}); | ||
| } | ||
| return out; | ||
| })()""") | ||
| ``` | ||
|
|
||
| Then parse `text` line-by-line. The first line is `"<Name> • <Degree>"` (split on `•`). The line containing `"mutual connection"` is the mutuals. Lines `"Connect" / "Message" / "Follow"` are buttons — skip. Lines starting `"Past:" / "Summary:" / "Current:"` are auto-generated keyword highlights — useful for relevance scoring. Lines starting `"Select all" / "Add N people to folk"` are list-controls — skip. | ||
|
|
||
| ## Pagination | ||
|
|
||
| Scroll-load won't help — LinkedIn paginates server-side. After landing on `&page=1`, navigate to `&page=2`, etc. Each page returns at most 10 results. | ||
|
|
||
| Stopping conditions: | ||
| - A page returns 0 new (deduped) results → keyword exhausted, move on | ||
| - A page returns fewer than ~5 cards → last page of results | ||
| - `page=100` reached → LinkedIn cap; if the keyword is broader than 1000 hits you need a tighter keyword | ||
|
|
||
| Sleep ~2s per page-load via `time.sleep(2)` after `wait_for_load`. Sleep ~1s between keywords. LinkedIn does not aggressively rate-limit People-search at this pace from a real logged-in session. | ||
|
|
||
| ## Anti-bot | ||
|
|
||
| **Use a real signed-in Chrome session, not headless.** LinkedIn fingerprints heavily. The `feedback_real_chrome_beats_headless_for_anti_bot` lesson applies — `start_remote_daemon` with a clean profile gets you 1-3 results then a soft-block. The user's actual Chrome with all cookies + extensions + real navigator fingerprint sails through. | ||
|
|
||
| No `stealth_patch()` needed when on real-Chrome — only useful in headless/cloud mode. | ||
|
|
||
| ## Tab management gotcha | ||
|
|
||
| `new_tab(url)` creates a new tab but `switch_tab()` is required to make subsequent `js()` / `goto()` calls target it. If the user manually switches tabs between calls, the daemon may follow their focus — list tabs and explicitly switch back: | ||
|
|
||
| ```python | ||
| tabs = list_tabs(include_chrome=False) | ||
| linkedin_tab = next((t for t in tabs if 'linkedin.com/search' in t.get('url','')), None) | ||
| if linkedin_tab: | ||
| switch_tab(linkedin_tab['targetId']) | ||
| ``` | ||
|
|
||
| ## Keyword strategy | ||
|
|
||
| Hit-rate per keyword varies wildly. Sample patterns from a real run against a target with a regional emerging-markets business network (39 keywords, ~380 unique results, ~10 min wall-time): | ||
|
|
||
| | Keyword type | Example | Typical hits | | ||
| |---|---|---| | ||
| | **Geographic country** | a target-region country name | 50-80 | | ||
| | **Geographic city** | a target-region city name | 0-20 | | ||
| | **Broad industry** | `engineering`, `trading`, `logistics` | 50-100 (high noise) | | ||
| | **Specific company** | a named target company | 0-3 (high precision) | | ||
| | **Industry term** | `manufacturing`, `procurement`, `mining` | 5-15 | | ||
| | **Niche term** | a specific product or technology | 0 (target may not have these) | | ||
|
|
||
| **Geo-keywords yield best per-query precision for finding "people in X country who do Y."** Company-name keywords are highest precision but often return 0 because exact-employer-match is rare in a non-overlapping network. Industry terms are middle — broad enough to surface candidates, narrow enough to filter noise. | ||
|
|
||
| ## End-to-end runnable script | ||
|
|
||
| Self-contained. Replace `CONN_OF` with the target's URN (see "Getting the member URN" above) and `TARGET_LABEL` + `KEYWORDS` for your search. Run via `browser-harness < scrape.py` from a real-Chrome session signed into LinkedIn. Re-runnable + resume-safe — dedupes against prior output, appends only. | ||
|
|
||
| ```python | ||
| import time, json, urllib.parse | ||
| from pathlib import Path | ||
|
|
||
| # === EDIT THESE === | ||
| CONN_OF = "ACoAA<...REPLACE WITH TARGET URN...>" | ||
| TARGET_LABEL = "target_label" # used in output filename | ||
| KEYWORDS = [ | ||
| # broad industry | ||
| "engineering", "trading", "logistics", "manufacturing", | ||
| # named companies you care about | ||
| "AcmeCorp", "TargetCo", | ||
| # geographies | ||
| "Country1", "City1", | ||
| ] | ||
| MAX_PAGES_PER_KEYWORD = 10 # LinkedIn caps at 100; 10 is usually plenty per keyword | ||
|
|
||
| # === END EDIT === | ||
|
|
||
| BASE = (f"https://www.linkedin.com/search/results/people/" | ||
| f"?connectionOf=%5B%22{CONN_OF}%22%5D" | ||
| f"&network=%5B%22F%22%2C%22S%22%5D" | ||
| f"&origin=MEMBER_PROFILE_CANNED_SEARCH") | ||
|
|
||
| OUT = Path(f"/tmp/{TARGET_LABEL}_connections.jsonl") | ||
| OUT.parent.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| seen = set() | ||
| if OUT.exists(): | ||
| for line in OUT.read_text().splitlines(): | ||
| try: | ||
| r = json.loads(line) | ||
| seen.add(r.get("href", "")) | ||
| except Exception: | ||
| pass | ||
|
|
||
| def parse_card(text, href): | ||
| lines = [l.strip() for l in text.split("\n") if l.strip()] | ||
| if not lines: return None | ||
| parts = lines[0].split("•") | ||
| name = parts[0].strip() | ||
| degree = parts[1].strip() if len(parts) > 1 else "" | ||
| headline = lines[1] if len(lines) > 1 else "" | ||
| location, mutuals, summary = "", "", "" | ||
| for l in lines[2:]: | ||
| if l in ("Connect","Message","Follow") or l.startswith("Select all") or l.startswith("Add "): | ||
| continue | ||
| if "mutual connection" in l.lower() or " is a mutual" in l.lower(): | ||
| mutuals = l | ||
| elif l.startswith(("Past:","Summary:","Current:")): | ||
| summary += (" | " + l) if summary else l | ||
| elif not location: | ||
| location = l | ||
| return {"href": href, "name": name, "degree": degree, "headline": headline, | ||
| "location": location, "summary": summary, "mutuals": mutuals} | ||
|
|
||
| def scrape_current_page(): | ||
| for _ in range(3): | ||
| js("window.scrollBy(0, 1200)") | ||
| time.sleep(0.6) | ||
| js("window.scrollTo(0, document.body.scrollHeight)"); time.sleep(1.2) | ||
| js("window.scrollTo(0, 0)"); time.sleep(0.4) | ||
| return js("""(() => { | ||
| const anchors = Array.from(document.querySelectorAll('a[href*="/in/"]')); | ||
| const out = [], seen = new Set(); | ||
| for (const a of anchors) { | ||
| const u = new URL(a.href); | ||
| if (!u.pathname.match(/^\\/in\\/[^/]+\\/?$/)) continue; | ||
| if (!a.innerText.includes('•')) continue; | ||
| const href = u.pathname.replace(/\\/$/, ''); | ||
| if (seen.has(href)) continue; | ||
| seen.add(href); | ||
| out.push({href, text: a.innerText}); | ||
| } | ||
| return out; | ||
| })()""") or [] | ||
|
|
||
| for kw in KEYWORDS: | ||
| url = BASE + "&keywords=" + urllib.parse.quote(kw) | ||
| goto(url); wait_for_load(timeout=20); time.sleep(2.5) | ||
| for page in range(1, MAX_PAGES_PER_KEYWORD + 1): | ||
| if page > 1: | ||
| goto(url + f"&page={page}"); wait_for_load(timeout=20); time.sleep(2) | ||
| cards = scrape_current_page() | ||
| if not cards: break | ||
| new = 0 | ||
| with OUT.open("a") as f: | ||
| for c in cards: | ||
| if c["href"] in seen: continue | ||
| rec = parse_card(c["text"], c["href"]) | ||
| if not rec: continue | ||
| rec["keyword"] = kw | ||
| f.write(json.dumps(rec, ensure_ascii=True) + "\n") | ||
| seen.add(c["href"]); new += 1 | ||
| print(f" [{kw}] page {page}: {len(cards)} cards, {new} new") | ||
| if new == 0 and page > 1: break | ||
| if len(cards) < 5: break | ||
| time.sleep(1.0) | ||
|
|
||
| print(f"DONE — total unique: {len(seen)}, output: {OUT}") | ||
| ``` | ||
|
|
||
| ## Post-processing — scoring for relevance | ||
|
|
||
| Raw output is high-recall, low-precision. Run a scoring pass with regex tiers for: | ||
| - **Tier 1 (50pt)** named target companies — exact-match to the specific employers you're hunting | ||
| - **Tier 2 (25pt)** industry vocab — domain-relevant role and function keywords | ||
| - **Tier 3 (20pt)** adjacent-industry vocab — named market participants, related sectors | ||
| - **Geography (15pt)** target-region location strings | ||
| - **Government (10pt)** `minister`, `ambassador`, `consul` when paired with geography (if you're hunting through diplomatic/state channels) | ||
| - **Noise penalty (-15pt)** disqualifying terms relevant to the target's adjacent industries (e.g. for a B2B sourcing search, penalize `tourism`, `hotel`, `fashion`, `media`, `advertising`, `software engineer` — these flood broad networks) | ||
|
|
||
| Sort descending. Cut at score ≥ a threshold matched to your tolerance for noise (15 is reasonable when geography alone qualifies). | ||
|
|
||
| ## What this does NOT give you | ||
|
|
||
| - **Email addresses.** Profile anchors don't carry them. Use `voyager_scrape_2026-04-25.py` per-handle for full profile dumps if you need positions/education/etc. | ||
| - **3rd-degree connections of the target.** The `network` parameter is relative to the asker's graph, not the target's. To enumerate the target's full 1st-degree network you'd need to log in as the target. | ||
| - **Hidden profiles.** People who set their connections to private won't appear regardless of the filter. | ||
|
|
||
| ## Watch out for | ||
|
|
||
| - **The URL must use percent-encoded JSON in `connectionOf` and `network`.** `%5B` = `[`, `%5D` = `]`, `%22` = `"`. Bare brackets will silently 400 or redirect to an empty result page. | ||
| - **Profile-card anchor `innerText` includes a trailing `Connect` / `Message` / `Follow` button label.** Filter it out during parsing. | ||
| - **The first profile anchor on a card has the full payload; subsequent same-`/in/`-handle anchors have just the name.** Dedupe by pathname before parsing. | ||
| - **Some profiles render `<Name> \n • <Degree>` with the bullet on a separate line.** Split on `•` not on `\n` for the first parse. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1:
parse_cardmishandles cards where• <Degree>is rendered on a separate line, causing field misalignment (degree/headline parsed incorrectly).Prompt for AI agents