-
JD Skills: List of required skills from Job Description.
- Each skill may have an optional importance weight (default = equal weight if not specified).
-
Resume Skills: List of extracted skills from candidate resume.
- Optionally, each skill may have a proficiency level (years of experience, seniority, certifications).
-
Embedding Model: Pre-trained model (HuggingFace / sentence-transformers) to compute semantic similarity between skills.
-
Hyperparameters:
threshold: minimum similarity to consider a match valid (e.g., 0.65–0.7)bonus_alpha: weight of bonus score for resume-only relevant skills (0.1–0.3)- Optional:
max_bonus_capto limit effect of extra skills.
For scoring purposes, classify skills into categories:
| Category | Description |
|---|---|
| A. JD-only skills | Skills required by JD but missing in resume. Should penalize. |
| B. Matched skills | Skills present in both JD and resume (direct or semantic match). Weighted positively. |
| C. Resume-only relevant skills | Skills in resume not explicitly listed in JD but semantically related to JD skills. Add small bonus. |
| D. Resume-only irrelevant skills | Skills in resume unrelated to JD skills. Ignore. |
| E. Clustered / Many-to-One skills | Multiple resume skills mapping to a single JD skill (e.g., ML → TensorFlow + scikit-learn). Aggregate similarity. |
-
Compute semantic similarity between each JD skill and all resume skills.
-
Aggregate matches for clustered skills:
- Option 1: Max similarity (
max(sim_jd_to_resume)) - Option 2: Soft OR:
1 - ∏(1 - sim(r_i, jd_skill))→ captures multiple contributions but caps at 1.
- Option 1: Max similarity (
-
Apply similarity threshold:
- If
sim < threshold, treat as missing skill → similarity = 0.
- If
- Compute weighted sum of JD skills: [ main_score = \frac{\sum (jd_weight * sim)}{\sum jd_weight} ]
- Missing JD skills reduce
main_scorebecause their similarity is zero.
- Identify resume skills not mapped to any JD skill.
- For each, compute relevance as max similarity to any JD skill.
- Include only skills with relevance > threshold_bonus (e.g., 0.7).
- Compute bonus as: [ bonus = \alpha * \text{average(relevant resume-only similarities)} ]
- Add bonus to main_score, capped to prevent inflation.
- Experience weighting: multiply skill similarity by years of experience if available.
- Proficiency scaling: scale similarity by proficiency level (junior/intermediate/senior).
- Skill importance overrides: allow recruiter to mark certain JD skills as mandatory (weight = 1), others as nice-to-have (weight < 1).
Return a structured result that is directly usable by the dashboard:
| Field | Description |
|---|---|
skill_score |
Final numeric score (0–1 or 0–100). |
matched_skills |
Dictionary of JD skills → best matched resume skill(s) with similarity. |
missing_skills |
List of JD skills with no sufficient match. |
bonus_skills |
Resume-only skills contributing to bonus score. |
breakdown |
Optional detailed breakdown: similarity × weight per JD skill. |
- JD skills missing in resume → penalized (score = 0 for that skill).
- Multiple resume skills matching one JD skill → aggregated similarity (max or soft OR).
- Resume-only skills relevant to JD → small bonus (controlled via
alpha). - Resume-only irrelevant skills → ignored.
- Variable importance / weighting → JD skills weighted according to importance.
- Low similarity matches → filtered by threshold.
- Experience/proficiency consideration → optional multiplier on similarity.
- Empty resume skills → final skill score = 0.
- Empty JD skills → undefined; can default to 1 or error.
- Skill synonym expansion: expand JD skills to include related concepts (ML → TensorFlow, scikit-learn, PyTorch).
- Knowledge graph integration: identify related skills automatically.
- Gap analysis for career coaching: highlight missing JD skills to the candidate.
| Skill | Weight |
|---|---|
| Python | 0.3 |
| Machine Learning | 0.4 |
| AWS | 0.2 |
| Data Visualization | 0.1 |
- Python
- TensorFlow
- scikit-learn
- Docker
-
Matched / Clustered Skills (JD skill present in resume or related skills)
- Machine Learning → TensorFlow, scikit-learn
- Python → Python
-
JD-only skills missing
- AWS
- Data Visualization
-
Resume-only relevant skills
- Docker (related to AWS/DevOps)
-
Resume-only irrelevant skills
- None in this example
- Python → Python = 1.0 (direct match)
- Machine Learning → TensorFlow, scikit-learn = aggregate similarity 0.82 (soft OR of both)
- AWS → No match → 0 (missing skill)
- Data Visualization → No match → 0 (missing skill)
Resume-only skill (Docker) has relevance to AWS = 0.6 → below bonus threshold → ignored for bonus.
Weighted sum of JD skills:
[
SkillScore_{main} = (0.31.0) + (0.40.82) + (0.20) + (0.10) = 0.3 + 0.328 + 0 + 0 = 0.628
]
Divide by total weight (1.0) → 0.628
- No bonus because Docker similarity to JD skills < 0.7 threshold.
- Bonus = 0
[
SkillScore = MainScore + Bonus = 0.628 + 0 = 0.628
]
Final Score: 62.8%
{
"skill_score": 0.628,
"matched_skills": {
"Python": ["Python"],
"Machine Learning": ["TensorFlow", "scikit-learn"]
},
"missing_skills": ["AWS", "Data Visualization"],
"bonus_skills": []
}| Case | Example | How Handled |
|---|---|---|
| JD skill missing in resume | AWS | similarity=0 → penalizes score, appears in missing_skills |
| JD skill matched (direct) | Python | similarity=1 → full weight applied |
| JD skill matched via cluster | Machine Learning → TensorFlow + scikit-learn | aggregate similarity applied |
| Resume-only relevant skill | Docker | similarity < threshold → ignored for bonus (if ≥ threshold, small bonus added) |
| Resume-only irrelevant skill | (none) | ignored |
| Weighted JD skills | All skills have weights | used in main score calculation |
This example clearly shows how:
- Missing JD skills reduce the score.
- Clustered skills are handled together.
- Resume-only relevant skills can optionally add small bonus.
- Final score is a weighted combination — not a simple average.
- Undirected Edges are Insufficient: You cannot rely on undirected co-occurrence. "PyTorch" implies "Python", but "Python" does not imply "PyTorch". If a JD asks for Python and I have PyTorch, I should match. If a JD asks for PyTorch and I only have Python, I should fail. Your current graph treats them as equals.
- The "Capability Layer" is Operational Debt: Manually maintaining "capability tags" (analytics, infra, etc.) for 10,000+ rapidly changing skills is impossible. You will drown in maintenance. The graph structure itself must solve this.
- Ambiguous Scoring: "Penalize" and "Reward" are too vague. You need a normalized mathematical framework, otherwise, your scores will drift (e.g., a candidate with 100 irrelevant skills might outscore a focused candidate simply by accumulating tiny "Category C" bonuses).
We are moving from a Co-occurrence Graph (undirected, "A is related to B") to an Implication Graph (directed, "B implies proficiency in A").
We do not store a monolithic adjacency matrix. We store a Knowledge Graph with two specific node types and embedding support.
Each node represents a Skill.
{
"id": "skill_123",
"canonical_name": "PostgreSQL",
"type": "SKILL",
"cluster_id": 12, // From offline community detection (e.g., "Relational DBs")
"popularity_score": 0.85, // 0 to 1 (Global frequency)
"embedding": [0.12, -0.45, ...] // S-BERT vector of the skill description/context
}
We store two types of edges. This is critical for the "Directionality" problem.
| Edge Type | Direction | Meaning |
|---|---|---|
| CO_OCCUR | Undirected | "People often have both" |
| IMPLIES | Directed () | "Knowing A strongly implies knowing B" |
Example:
PyTorch-> IMPLIES (0.95) ->PythonPython-> IMPLIES (0.05) ->PyTorch(Weak edge, pruned)React<-> CO_OCCUR <->Node.js(Strong ecosystem overlap, but one doesn't strictly imply the other)
We automate the "Capability" check using vectors and conditional probability, removing the manual tagging need.
Step 1: Metric Calculation For every pair in the resume corpus:
- Co-occurrence: Jaccard Index.
- Implication: Calculate Conditional Probability .
- If is high (> 0.7) and is low, create a directed IMPLIES edge from A to B.
- If both are roughly equal and high, create a CO_OCCUR edge.
Step 2: Semantic Guardrails (The "Anti-Redis-Analytics" Check) Before saving an edge, compute the Cosine Similarity between the embeddings of Skill A and Skill B.
- If
EdgeWeightis high butVectorSimilarityis low (e.g., "Java" and "Recruiting" often appear together in HR tech resumes), PRUNE THE EDGE. This automatically filters buzzword noise without manual tags.
Step 3: Community Detection
Run Louvain/Leiden on the graph. Store the ClusterID on every node. This replaces your "Capability Layer." If Node A and Node B are in different clusters, they are functionally distinct.
This is the exact logical flow to code.
- ****: Set of Skills in Job Description (weighted by explicit "Required" vs "Nice to have").
- ****: Set of Skills in Resume.
We don't just look for the JD skills. We look for what they imply and what they are part of.
For each skill :
- Fetch from the graph.
- Expansion A (Alternatives): Fetch neighbors connected via strong CO_OCCUR edges (e.g., JD asks for "AWS", graph suggests "Azure" is a valid alternative/context).
- Expansion B (Children): Fetch nodes that IMPLY (e.g., JD asks for "Python"; graph knows "Django" implies "Python").
This creates a localized subgraph .
Iterate through every skill and map it against :
| Relationship to JD Skill () | Classification | Scoring Logic |
|---|---|---|
| Direct Match | Max Score (1.0) | |
| Deep Match | High Score (1.0) (e.g. JD: Python, Res: PyTorch) | |
| Broad Match | Partial Score (0.5) (e.g. JD: PyTorch, Res: Python) | |
| Adjacent | Low Score (0.2 - 0.4) based on weight | |
| Same Cluster, No Edge | Thematic | Tiny Bonus (0.05) |
| Different Cluster | Irrelevant | 0.0 |
We need a unified score, not just buckets.
Let be the weight of a JD skill (e.g., 1.0 for required, 0.5 for optional). Let be the best match score for JD skill found in the resume.
Category A (Missing Critical Skills): Handled by the denominator. If you miss a high skill, your max possible score drops significantly.
- Refinement: Apply a Non-Linear Penalty. If coverage of "Required" skills < 50%, multiply final score by 0.5.
Category B (Matches): Handled by "Direct Match" and "Deep Match" logic ().
Category C (Relevant Bonus): Handled by "Adjacent" logic. If JD wants "Postgres" and you have "MySQL" (strong co-occur/cluster match), you get 0.3 points. This boosts you over a candidate with nothing, but keeps you below a perfect match.
Category D (Irrelevant): Filtered naturally. If a skill isn't in (the subgraph) and isn't in the same cluster, it contributes 0 to the numerator.
Category E (Many-to-One):
We use a Max() function per JD skill bucket.
- If JD asks for "AWS" ().
- Resume has "EC2", "S3", "Lambda".
- All three imply "AWS".
- We do not sum them (which would explode the score). We take . You can saturate the "AWS" requirement, but you cannot exceed it to compensate for missing "Python".
- Fixed Directionality: "Deep Matches" (I have PyTorch, you asked for Python) are now mathematically distinct from "Broad Matches".
- Removed Manual Tags: Used Vectors + Community Detection to automate the "Context/Capability" layer.
- Bounded Scoring: The score is a rigorous percentage () representing "Percentage of Job Requirements Met", rather than an arbitrary integer.
- Implicit vs Explicit: We distinguish between explicitly asked skills and implicit gap filling using the IMPLIES edge type.
This system is defensible because every match is traceable to a specific edge in the graph, and the graph is built on statistically significant real-world data, not manual rules.