Expert_ML_RAG_API/DATA_SOURCES.md at main · postybaloney/Expert_ML_RAG_API

Data was collected primarily from Github and Medium. I used utilized ChatGPT to generate a small sample for StackOverflow intake as well, but that was not the primary focus. I utilized a list of what I thought were relevant machine learning keywords to then use to automate searching up experts on GitHub and Medium.

I mainly used Selenium to automate the process of collecting expert usernames from Github and Medium. The entire code for this process is found in the scrapper.py file. For GitHub, the url ("https://github.com/search?q={keyword}&type=users&s=followers&o=desc") used significantly decreased the webdriver time of simple navigation to the users field which sped up the time it took for collecting inputs. For Medium, there was no direct access to the page without login which is why there is a one-time login process that occurs and requires a preexisting Medium account and access to the email as well as knowledge to pick the code setting for validation. Afterwards, the automation takes the process of collecting the usernames for the scripts in collector.py to create the expert JSON files.

There is an existing rate limit for both Medium and Github, which was circumvented by the time.sleep(10) in github_users() and the time.sleep(15) in medium_users(). A lower time.sleep() value would result in either a rate limit exception page for GitHub or a CloudFlare authentication procedure for Medium.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

DATA_SOURCES.md

Latest commit

History

DATA_SOURCES.md

File metadata and controls