Add ELOG scraper and deployment for FNAL dCache ELOG using FNAL Ollama server#456
Conversation
- Add ElogScraper to crawl ELOG logbooks (pagination, entry parsing, structured metadata extraction: tech, category, node, incident_date, etc.) - Support explicit `elog-<url>` prefix in input lists for unambiguous ELOG URL detection alongside the existing heuristic (_is_elog_url) - Update cms-comp-ops agent prompt with ELOG tool guidance: use tech: field for person queries, cite url metadata (not internal hashes), clarify that [N] are result indices not ELOG entry numbers, note 5-result limit - Add examples/deployments/basic-ollama-fnal with config targeting ollama.fnal.gov and the FNAL dCache ELOG as a data source Assisted by Claude Sonnet 4.6
There was a problem hiding this comment.
Pull request overview
Adds first-class scraping support for ELOG logbooks and wires it into the existing ScraperManager, along with an example FNAL/Ollama deployment configuration and agent guidance.
Changes:
- Extend
ScraperManagerto detectelog-URLs (and simple heuristics) and run an ELOG collection step. - Introduce
ElogScraperintegration to crawl ELOG index pages, discover entries, and persist each entry as aScrapedResource. - Add example deployment config, input lists, and prompt/agent guidance updates for CMS Comp Ops usage.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/data_manager/collectors/scrapers/scraper_manager.py |
Adds ELOG config parsing, URL classification, and a new collect_elog collection path. |
src/data_manager/collectors/scrapers/integrations/elog_scraper.py |
New requests/BeautifulSoup-based crawler for ELOG pagination + entry extraction. |
examples/deployments/basic-ollama-fnal/miscellanea.list |
Example input list content (mostly commented). |
examples/deployments/basic-ollama-fnal/dcache-elog.list |
Example ELOG-prefixed logbook URL input list. |
examples/deployments/basic-ollama-fnal/config.yaml |
Example deployment config enabling ELOG scraping options and referencing input lists. |
examples/deployments/basic-ollama-fnal/condense.prompt |
Adds condense prompt template for the example deployment. |
examples/deployments/basic-ollama-fnal/agent.prompt |
Adds example agent prompt. |
examples/agents/cms-comp-ops.md |
Adds guidance for using metadata tools with ELOG-derived fields/URLs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def collect_elog(self, persistence: PersistenceService, extra_urls: List[str] = []) -> int: | ||
| """Collect all entries from configured ELOG logbooks. | ||
|
|
| urls_to_scrape: List[str] = list(extra_urls) | ||
| if self.elog_enabled: | ||
| urls_to_scrape.append(self.elog_config.get("url")) | ||
|
|
||
| if not urls_to_scrape: | ||
| return 0 | ||
|
|
||
| total = 0 | ||
| for url in urls_to_scrape: | ||
| cfg = {**self.elog_config, "url": url} | ||
| scraper = ElogScraper(cfg) |
| self.base_url = config.get("url", "").rstrip("/") + "/" | ||
| self.max_entries: Optional[int] = config.get("max_entries") | ||
| self.verify_ssl = config.get("verify_ssl", False) | ||
| self._session = requests.Session() | ||
| if not self.verify_ssl: | ||
| import urllib3 | ||
| urllib3.disable_warnings() | ||
|
|
| for part in text.split(): | ||
| pass # entry_time already in meta via hidden inputs if needed |
| # This is a very general prompt for condensing histories, so for base installs it will not need to be modified | ||
| # | ||
| # All condensing prompts must have the following tags in them, which will be filled with the appropriate information: | ||
| # {chat_history} |
| @@ -0,0 +1,54 @@ | |||
| # Basic configuration file for a Archi deployment | |||
| local: | ||
| enabled: true | ||
| base_url: https://ollama.fnal.gov # make sure this matches your ollama server URL! | ||
| mode: ollama #call to LanChain class ChatOllama, other option is openai_compat which calls ChatOpenAI LanChain class |
| def _discover_entry_urls(self) -> list[str]: | ||
| """Return deduplicated entry URLs collected from all index pages.""" | ||
| seen: set[str] = set() | ||
| result: list[str] = [] | ||
|
|
|
I left some initial comments and I'm testing it out using FNAL's Storage Archi instance. However, I think a big part that is missing is the base-config.yaml and setting up the config manager so that the e-log configuration is propagated there as well. |
|
Thank you Juan Pablo. I am taking a look on these comments |
There was a problem hiding this comment.
The e-log scraping functionality is working as expected. As this was moved to the dev branch I think it's okay to leave the example deployment and agent as it is. However, if we later want to move this to the main branch, I would suggest removing the example deployment as it overlaps mostly with other deployments and the only relevant changes are highly specific.
There was a problem hiding this comment.
Since the CMS CompOps agent is not going to be using this tool, I would move it to another example agent for fetching the logs.
There was a problem hiding this comment.
I would remove this condense.prompt altogether as it's not really being used. Also change the deployment example name from "basic-ollama-fnal" to something a little bit more specific to the e-log use case.
There was a problem hiding this comment.
You can remove this list from this example as I don't think it's relevant.
There was a problem hiding this comment.
This example config should only be related to the Elog instance IMO. Or in any case, craft an example deployment for the FNAL storage use case but remove a lot of unnecessary things from here: redmine, jira, miscellanea.list, etc.
elog-<url>prefix in input lists for unambiguous ELOG URL detection alongside the existing heuristic (_is_elog_url)Assisted by Claude Sonnet 4.6