Based on the corrupted data here is the list of pages with corrupted ca:
WITH wappalyzer AS (
SELECT
category
FROM wappalyzer.apps,
UNNEST(categories) AS category
)
SELECT
technology,
category,
count(distinct page) AS cnt_pages,
ARRAY_AGG(DISTINCT page LIMIT 3) AS sample_pages
FROM crawl.pages,
UNNEST (technologies) AS technology,
UNNEST (technology.categories) AS category
LEFT JOIN wappalyzer
USING (category)
WHERE date = '2024-11-01'
AND wappalyzer.category IS NULL
GROUP BY 1,2
order by category ASC
The detection seems to work fine. It looks like page context is messing with some built-in objects again.
Maybe we could avoid using any values that could be impacted by it.
A few cases:
One of the observations - in most of these cases only the values within detected_technologies have correct data (keys are also impacted).
Maybe we should switch to it for the BigQuery data?
For example:
technologies = [
{
"technology": technology["name"],
"categories": [category["name"] for category in technology["categories"]],
"info": [technology["version"]]
}
for technology in detected_technologies.values()
]
Based on the corrupted data here is the list of pages with corrupted ca:
The detection seems to work fine. It looks like page context is messing with some built-in objects again.
Maybe we could avoid using any values that could be impacted by it.
A few cases:
undefined)One of the observations - in most of these cases only the values within
detected_technologieshave correct data (keys are also impacted).Maybe we should switch to it for the BigQuery data?
For example: