For example command, add option for YAML result file#120
For example command, add option for YAML result file#120tanmay-9 wants to merge 5 commits intoqlever-dev:mainfrom
example command, add option for YAML result file#120Conversation
4737cb5 to
a6751ba
Compare
a6751ba to
4b0aa23
Compare
4b0aa23 to
20ba9ba
Compare
example command, add option for YAML result file
hannahbast
left a comment
There was a problem hiding this comment.
@tanmay-9 Thank you for this. I have now had the chance to look at this and play around with it and leave a first round of comments.
| from pathlib import Path | ||
| from typing import Any | ||
|
|
||
| from rdflib import Graph |
There was a problem hiding this comment.
Please explain what you need Graph for? Anyway, as long as it's in there, you have to include rdflib in the dependencies in the pyproject.toml
There was a problem hiding this comment.
As I am storing the headers and results of a SPARQL query in the output YAML file, for text/turtle accept header (Construct and Describe queries), I am using rdflib.Graph to get the results in ?s ?p ?o form
I had added rdflib to toml in the web-app pull request but forgot about it here. I will make that change
| from ruamel.yaml import YAML | ||
| from ruamel.yaml.scalarstring import LiteralScalarString |
There was a problem hiding this comment.
Why do you import ruamel.yaml and mot the seemingly more standard yaml?
There was a problem hiding this comment.
ruamel.yaml has support for YAML 1.2, preserves formatting and seems to be more actively maintained. I have worked with it before and continued using it here. But, if it's not needed I can switch to PyYaml as well.
| subparser.add_argument( | ||
| "--generate-output-file", | ||
| action="store_true", | ||
| default=False, | ||
| help="Generate output file in the 'output' directory", | ||
| ) | ||
| subparser.add_argument( | ||
| "--backend-name", | ||
| default=None, | ||
| help="Name for the backend that would be used in performance comparison", | ||
| ) | ||
| subparser.add_argument( | ||
| "--output-basename", | ||
| default=None, | ||
| help="Name for the dataset that would be used in performance comparison", | ||
| ) |
There was a problem hiding this comment.
This should be one option, where the argument is the basename of the output file, for example --output-yml dblp.qlever
As far as I can see, the only other place where you need one of these argument separately is when you determine whether the endpoint is a QLever endpoint, in order to determine the media type for the output. But there is an option --accept for that already . Note that producing application/qlever-results+json is more expensive (because it contains more data), so a user should be aware of this, especially when comparing to other engines, otherwise the comparison is unfair.
There was a problem hiding this comment.
I put the --backend-name and --output-basename arguments because then I can generate the output the file as f"{args.output_basename}.{args.backend_name}.results.yaml". The evaluation web app then can directly read these files from the output folder and extract the name of the dataset and engine and display the comparison results.
There was a problem hiding this comment.
Currently, when output file needs to be generated, I am automatically using tsv format to get results for non-qlever backends, qlever-results+json format for qlever backends ( so that runtime execution tree information can be displayed in the webapp) and text/turtle format for Construct and Describe queries. I am doing this so that I can read the SPARQL query results easily and put them in the output YAML file. But yes, I think I'll have to change this logic if a standardized comparison needs to be done between various engines.
| record["runtime_info"] = json.loads(runtime_info_str) | ||
| record["runtime_info"]["client_time"] = client_time | ||
| record["headers"] = headers | ||
| record["results"] = results |
There was a problem hiding this comment.
Please remind me what the point of having the results in the YAML output is?
I see the MAX_RESULT_SIZE in the code. This should be an option. And depending on your answer to the first question, it might be better to have no results (or very few results) in the output by default.
There was a problem hiding this comment.
It is just something that comes from the evaluation project. The output yaml files in it had the query results in them. So, for my project, it was discussed that it would be nice to be able to display the results of the query in the web app as well. This way we can also verify if the query results are correct (can also be done with just the COUNT). I agree that it does add a fair bit of complexity to the example-queries command here and for simple benchmarking purposes doesn't add much.
|
@tanmay-9 And another question: I tried to use the YAML files produced with the new |
|
I have changed the structure of the generated YAML file with |
697dc3f to
f15b215
Compare
…file in example-queries
…sults+json) and not qlever(tsv) in example-queries
…ample-queries command
f15b215 to
06d3eaa
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for YAML input files to read queries and enables output file generation for performance comparisons using YAML. The changes include new command-line arguments for specifying query and output file options, new helper functions for parsing YAML and constructing output records, and an update to project dependencies.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/qlever/commands/example_queries.py | Added new arguments; introduced functions to handle YAML queries and output. |
| pyproject.toml | Updated dependencies to include ruamel.yaml and rdflib. |
Comments suppressed due to low confidence (2)
src/qlever/commands/example_queries.py:331
- The variable 'is_qlever' is used before it is initialized. Please initialize it (e.g., set is_qlever to False) before using it in the conditional assignment.
is_qlever = is_qlever or "qlever" in args.backend_name.lower()
src/qlever/commands/example_queries.py:212
- The return type in the 'parse_queries_file' function is annotated as dict[str, list[str, str]], but the function actually returns a dictionary with the 'queries' key mapping to a list of dictionaries. Consider updating the type annotation to dict[str, list[dict[str, str]]].
def parse_queries_file(queries_file: str) -> dict[str, list[str, str]]:
{queries: [{query: str, sparql: str}]}