More corrupted values in technology detection data

Based on the corrupted data here is the list of pages with corrupted ca:
```sql
WITH wappalyzer AS (
  SELECT
    category
  FROM wappalyzer.apps,
    UNNEST(categories) AS category
)

SELECT
  technology,
  category,
  count(distinct page) AS cnt_pages,
  ARRAY_AGG(DISTINCT page LIMIT 3) AS sample_pages
FROM crawl.pages,
  UNNEST (technologies) AS technology,
  UNNEST (technology.categories) AS category
LEFT JOIN wappalyzer
USING (category)
WHERE date = '2024-11-01'
AND wappalyzer.category IS NULL
GROUP BY 1,2
order by category ASC
```

The detection seems to work fine. It looks like page context is messing with some built-in objects again.
Maybe we could avoid using any values that could be impacted by it.

A few cases:
- https://newcar.one2car.com/search ([capitalised](https://webpagetest.httparchive.org/result/241210_GV_2/))
- https://ascf.amorepacific.co.kr/ ([whitespaces removed](https://webpagetest.httparchive.org/result/241210_YX_3/))
- https://advancement.shu.edu/get-involved/events-calendar.html ([replaced with HTML](https://webpagetest.httparchive.org/jsonResult.php?test=241210_3H_5&pretty=1))
- https://www.gmi.go.kr/ (lowercase with dashes)
- https://iot.lostnfound.com/en/functions/ (replaced with `undefined`)
- etc

One of the observations - in most of these cases only the values within `detected_technologies` have correct data (keys are also impacted).
Maybe we should switch to it for the BigQuery data?
For example:
```python
technologies = [
    {
        "technology": technology["name"],
        "categories": [category["name"] for category in technology["categories"]],
        "info": [technology["version"]]
    }
    for technology in detected_technologies.values()
]
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More corrupted values in technology detection data #29

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

More corrupted values in technology detection data #29

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions