Use something along the line of the `ingredient de-duplicating pipeline` demonstrated in a neo4j tutorial using the BBC goodfood ingredients ### Pipeline - [ ] download the goodfood dataset (scraping required !) - or something equivalend - [ ] normal NLP preprocessing steps - [ ] character encodings - [ ] tokenization - [ ] stemming (plurals) - [ ] stopwords / length etc. - [ ] connect tokens to an ingredient - ingredient: `cherry tomato` => parts: `cherry` and `tomato` - [ ] Use string distance to create similarity edges - [ ] sorensenDiceSimilarity ?? - [ ] Use phonetic similarity to create similarity edges - [ ] doubleMetaphone ?? - [ ] Run a community detection algorithm (like Louvain) to cluster similar ingredients together
Use something along the line of the
ingredient de-duplicating pipelinedemonstrated in a neo4j tutorial using the BBC goodfood ingredientsPipeline
cherry tomato=> parts:cherryandtomato