Record ETag and Last-Modified for each URL#76
Conversation
86c2c4a to
88f6e3b
Compare
421a18c to
16b16b3
Compare
153883d to
769fc34
Compare
9c1fac9 to
09bf910
Compare
ararslan
left a comment
There was a problem hiding this comment.
I don't know how useful my review here actually is but here it is. It makes sense to me to include these values. Does the JSON schema need to be updated for the inclusion?
| local response = nothing | ||
| try | ||
| response = HTTP.head(url) | ||
| catch | ||
| error("Encountered error when making HEAD request to URL: $url") | ||
| end |
There was a problem hiding this comment.
I assume the idea behind intercepting an exception thrown by HTTP.head is to be able to ensure the URL is logged—is that right? As written, you lose what the actual error was. Instead, you could do something like this:
| local response = nothing | |
| try | |
| response = HTTP.head(url) | |
| catch | |
| error("Encountered error when making HEAD request to URL: $url") | |
| end | |
| response = try | |
| HTTP.head(url) | |
| catch | |
| @error "Encounted error when making HEAD request to URL: $url" | |
| rethrow() | |
| end |
That said, errors from HTTP.jl generally do tell you what the URL was, so you could alternatively just do
| local response = nothing | |
| try | |
| response = HTTP.head(url) | |
| catch | |
| error("Encountered error when making HEAD request to URL: $url") | |
| end | |
| response = HTTP.head(url) |
There was a problem hiding this comment.
I believe the original error still appears in the stacktrace, right? It'll be something like [our error] "caused by" [original error].
There was a problem hiding this comment.
Oh does it? I remember at some point the output from some errors doubled in length but it was never clear to me why or what makes it do that.
There was a problem hiding this comment.
Let me double-check locally to make sure.
There was a problem hiding this comment.
Yeah, I tested on Julia 1.10, and the original error is still shown.
julia> try
sqrt(-1)
catch
error("Encountered an error: [my debugging info]")
end
ERROR: Encountered an error: [my debugging info]
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] top-level scope
@ REPL[1]:4
caused by: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
[1] throw_complex_domainerror(f::Symbol, x::Float64)
@ Base.Math ./math.jl:33
[2] sqrt
@ ./math.jl:686 [inlined]
[3] sqrt(x::Int64)
@ Base.Math ./math.jl:1578
[4] top-level scope
@ REPL[1]:2
There was a problem hiding this comment.
As to why I want to throw my own error, you're right, it's so that I can see the full URL easily.
If I just do the call to HTTP.head(), here's what I get:
julia> import HTTP
julia> HTTP.head("https://example.com/foo/bar/baz")
ERROR: HTTP.Exceptions.StatusError(404, "HEAD", "/foo/bar/baz", HTTP.Messages.Response:
"""
HTTP/1.1 404 Not Found
Date: Fri, 01 May 2026 02:29:11 GMT
Content-Type: text/html
Connection: keep-alive
Server: cloudflare
Age: 8057
cf-cache-status: HIT
Content-Encoding: gzip
CF-RAY: 9f4b5b6c49bbcc9b-BOS
""")
Stacktrace: [elided]
So the error only shows the path (/foo/bar/baz), and I have to go back and look in the code to remind myself what the host was, which is annoying.
| file_dict["etag"] = headinfo.etag | ||
| end | ||
| if !isnothing(headinfo.last_modified) | ||
| file_dict["last-modified"] = headinfo.last_modified |
There was a problem hiding this comment.
For ease of downstream comparison, should we parse this into a DateTime and write it out as ISO-8601?
If we want to handle all of the formats mentioned in RFC 9110, this should work:
let fmts = [dateformat"e, d u y H:M:S \G\M\T", # IMF-fixdate (RFC 5322)
dateformat"E, d-u-y H:M:S \G\M\T", # RFC 850
dateformat"e u d H:M:S y"] # ANSI C asctime()
global function parse_http_date(dt)
dt = replace(dt, r"\s+" => " ") # asctime left-pads days with space instead of 0
for fmt in fmts
x = tryparse(DateTime, dt, fmt)
x !== nothing && return x
end
throw(ArgumentError("date is not in a recognized format: $dt"))
end
endBut I think the RFC 5322 format is what we can expect to receive from any server that doesn't think it's currently 1994.
There was a problem hiding this comment.
Yeah, I thought about parsing it, but I don't think we'll ever actually do date comparisons or arithmetic. I.e. my plan is to just see if the value of Last-Modified has changed, instead of comparing the timestamp to now().
|
Thank you for doing this! I think this will be quite helpful, especially since a recent from-scratch rebuild I did timed out at 6 hours, and it wasn't even that close to being done. |
18dc798 to
4809fab
Compare
4809fab to
d2808ad
Compare
Co-authored-by: Alex Arslan <ararslan@comcast.net>
d2808ad to
2d97116
Compare
This is needed for #67. We need the
ETagandLast-Modifiedto determine whether the information we have is fresh or stale.Cross-ref: #51
hat-tip: @ararslan, who suggested using
ETagandLast-Modified, because we use S3.