De-duplicating of nodes via similarity & community detection

Use something along the line of the `ingredient de-duplicating pipeline` demonstrated in a neo4j tutorial using the BBC goodfood ingredients

### Pipeline

- [ ] download the goodfood dataset (scraping required !) - or something equivalend
- [ ] normal NLP preprocessing steps
	- [ ] character encodings
	- [ ] tokenization
	- [ ] stemming (plurals)
	- [ ] stopwords / length etc.
- [ ] connect tokens to an ingredient
	- ingredient: `cherry tomato` => parts: `cherry` and `tomato`
- [ ] Use string distance to create similarity edges
	- [ ] sorensenDiceSimilarity ??
- [ ] Use phonetic similarity to create similarity edges
	- [ ] doubleMetaphone ??
- [ ] Run a community detection algorithm (like Louvain) to cluster similar ingredients together


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

De-duplicating of nodes via similarity & community detection #119

Pipeline

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

De-duplicating of nodes via similarity & community detection #119

Description

Pipeline

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions