Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 20 additions & 17 deletions ENV.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Please ensure these are properly defined in a `.env` file in the root directory.
| `DISCORD_WEBHOOK_URL` | The URL for the Discord webhook used for notifications | `abc123` |
| `HUGGINGFACE_INFERENCE_API_KEY` | The API key required for accessing the Hugging Face Inference API. | `abc123` |
| `HUGGINGFACE_HUB_TOKEN` | The API key required for uploading to the PDAP HuggingFace account via Hugging Face Hub API. | `abc123` |
| `INTERNET_ARCHIVE_S3_KEYS` | Keys used for saving a URL to the Internet Archives. | 'abc123:gpb0dk` |



Expand All @@ -32,25 +33,27 @@ Task flags are used to enable/disable certain tasks. They are set to `1` to enab

The following flags are available:

| Flag | Description |
|---------------------------------------|--------------------------------------------------------|
| `SCHEDULED_TASKS_FLAG` | All scheduled tasks. |
| `URL_HTML_TASK_FLAG` | URL HTML scraping task. |
| `URL_RECORD_TYPE_TASK_FLAG` | Automatically assigns Record Types to URLs. |
| Flag | Description |
|-------------------------------------|--------------------------------------------------------|
| `SCHEDULED_TASKS_FLAG` | All scheduled tasks. |
| `URL_HTML_TASK_FLAG` | URL HTML scraping task. |
| `URL_RECORD_TYPE_TASK_FLAG` | Automatically assigns Record Types to URLs. |
| `URL_AGENCY_IDENTIFICATION_TASK_FLAG` | Automatically assigns and suggests Agencies for URLs. |
| `URL_SUBMIT_APPROVED_TASK_FLAG` | Submits approved URLs to the Data Sources App. |
| `URL_MISC_METADATA_TASK_FLAG` | Adds misc metadata to URLs. |
| `URL_404_PROBE_TASK_FLAG` | Probes URLs for 404 errors. |
| `URL_AUTO_RELEVANCE_TASK_FLAG` | Automatically assigns Relevances to URLs. |
| `URL_PROBE_TASK_FLAG` | Probes URLs for web metadata. |
| `URL_ROOT_URL_TASK_FLAG` | Extracts and links Root URLs to URLs. |
| `SYNC_AGENCIES_TASK_FLAG` | Synchonize agencies from Data Sources App. |
| `SYNC_DATA_SOURCES_TASK_FLAG` | Synchonize data sources from Data Sources App. |
| `PUSH_TO_HUGGING_FACE_TASK_FLAG` | Pushes data to HuggingFace. |
| `URL_SUBMIT_APPROVED_TASK_FLAG` | Submits approved URLs to the Data Sources App. |
| `URL_MISC_METADATA_TASK_FLAG` | Adds misc metadata to URLs. |
| `URL_404_PROBE_TASK_FLAG` | Probes URLs for 404 errors. |
| `URL_AUTO_RELEVANCE_TASK_FLAG` | Automatically assigns Relevances to URLs. |
| `URL_PROBE_TASK_FLAG` | Probes URLs for web metadata. |
| `URL_ROOT_URL_TASK_FLAG` | Extracts and links Root URLs to URLs. |
| `SYNC_AGENCIES_TASK_FLAG` | Synchonize agencies from Data Sources App. |
| `SYNC_DATA_SOURCES_TASK_FLAG` | Synchonize data sources from Data Sources App. |
| `PUSH_TO_HUGGING_FACE_TASK_FLAG` | Pushes data to HuggingFace. |
| `POPULATE_BACKLOG_SNAPSHOT_TASK_FLAG` | Populates the backlog snapshot. |
| `DELETE_OLD_LOGS_TASK_FLAG` | Deletes old logs. |
| `RUN_URL_TASKS_TASK_FLAG` | Runs URL tasks. |
| `IA_PROBE_TASK_FLAG` | Extracts and links Internet Archives metadata to URLs. |
| `DELETE_OLD_LOGS_TASK_FLAG` | Deletes old logs. |
| `RUN_URL_TASKS_TASK_FLAG` | Runs URL tasks. |
| `IA_PROBE_TASK_FLAG` | Extracts and links Internet Archives metadata to URLs. |
| `IA_SAVE_TASK_FLAG` | Saves URLs to Internet Archives. |



## Foreign Data Wrapper (FDW)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
"""Add internet archives upload task

Revision ID: 8a70ee509a74
Revises: 2a7192657354
Create Date: 2025-08-17 18:30:18.353605

"""
from typing import Sequence, Union

from alembic import op
import sqlalchemy as sa

from src.util.alembic_helpers import id_column, url_id_column, created_at_column

# revision identifiers, used by Alembic.
revision: str = '8a70ee509a74'
down_revision: Union[str, None] = '2a7192657354'
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None

IA_PROBE_METADATA_TABLE_NAME_OLD = "urls_internet_archive_metadata"
IA_PROBE_METADATA_TABLE_NAME_NEW = "url_internet_archives_probe_metadata"

IA_UPLOAD_METADATA_TABLE_NAME = "url_internet_archives_save_metadata"

def upgrade() -> None:

Check warning on line 26 in alembic/versions/2025_08_17_1830-8a70ee509a74_add_internet_archives_upload_task.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] alembic/versions/2025_08_17_1830-8a70ee509a74_add_internet_archives_upload_task.py#L26 <103>

Missing docstring in public function
Raw output
./alembic/versions/2025_08_17_1830-8a70ee509a74_add_internet_archives_upload_task.py:26:1: D103 Missing docstring in public function
_create_internet_archive_save_metadata_table()
op.rename_table(IA_PROBE_METADATA_TABLE_NAME_OLD, IA_PROBE_METADATA_TABLE_NAME_NEW)



def downgrade() -> None:

Check warning on line 32 in alembic/versions/2025_08_17_1830-8a70ee509a74_add_internet_archives_upload_task.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] alembic/versions/2025_08_17_1830-8a70ee509a74_add_internet_archives_upload_task.py#L32 <103>

Missing docstring in public function
Raw output
./alembic/versions/2025_08_17_1830-8a70ee509a74_add_internet_archives_upload_task.py:32:1: D103 Missing docstring in public function

Check failure on line 32 in alembic/versions/2025_08_17_1830-8a70ee509a74_add_internet_archives_upload_task.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] alembic/versions/2025_08_17_1830-8a70ee509a74_add_internet_archives_upload_task.py#L32 <303>

too many blank lines (3)
Raw output
./alembic/versions/2025_08_17_1830-8a70ee509a74_add_internet_archives_upload_task.py:32:1: E303 too many blank lines (3)
op.drop_table(IA_UPLOAD_METADATA_TABLE_NAME)
op.rename_table(IA_PROBE_METADATA_TABLE_NAME_NEW, IA_PROBE_METADATA_TABLE_NAME_OLD)

def _create_internet_archive_save_metadata_table() -> None:
op.create_table(
IA_UPLOAD_METADATA_TABLE_NAME,
id_column(),
url_id_column(),
created_at_column(),
sa.Column('last_uploaded_at', sa.DateTime(), nullable=False, server_default=sa.text('now()')),
)

Check warning on line 43 in alembic/versions/2025_08_17_1830-8a70ee509a74_add_internet_archives_upload_task.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] alembic/versions/2025_08_17_1830-8a70ee509a74_add_internet_archives_upload_task.py#L43 <292>

no newline at end of file
Raw output
./alembic/versions/2025_08_17_1830-8a70ee509a74_add_internet_archives_upload_task.py:43:6: W292 no newline at end of file

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
from src.db.models.impl.url.internet_archives.probe.pydantic import URLInternetArchiveMetadataPydantic

Check warning on line 1 in src/core/tasks/scheduled/impl/internet_archives/probe/convert.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/probe/convert.py#L1 <100>

Missing docstring in public module
Raw output
./src/core/tasks/scheduled/impl/internet_archives/probe/convert.py:1:1: D100 Missing docstring in public module
from src.external.internet_archives.models.ia_url_mapping import InternetArchivesURLMapping
from src.db.models.impl.flag.checked_for_ia.pydantic import FlagURLCheckedForInternetArchivesPydantic
from src.db.models.impl.url.ia_metadata.pydantic import URLInternetArchiveMetadataPydantic
from src.util.url_mapper import URLMapper


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
from src.db.enums import TaskType
from src.db.models.impl.flag.checked_for_ia.pydantic import FlagURLCheckedForInternetArchivesPydantic
from src.db.models.impl.url.error_info.pydantic import URLErrorPydanticInfo
from src.db.models.impl.url.ia_metadata.pydantic import URLInternetArchiveMetadataPydantic
from src.db.models.impl.url.internet_archives.probe.pydantic import URLInternetArchiveMetadataPydantic
from src.external.internet_archives.client import InternetArchivesClient
from src.external.internet_archives.models.ia_url_mapping import InternetArchivesURLMapping
from src.util.url_mapper import URLMapper
Expand Down
14 changes: 14 additions & 0 deletions src/core/tasks/scheduled/impl/internet_archives/save/filter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from src.core.tasks.scheduled.impl.internet_archives.save.models.mapping import URLInternetArchivesSaveResponseMapping

Check warning on line 1 in src/core/tasks/scheduled/impl/internet_archives/save/filter.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/filter.py#L1 <100>

Missing docstring in public module
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/filter.py:1:1: D100 Missing docstring in public module
from src.core.tasks.scheduled.impl.internet_archives.save.models.subset import IASaveURLMappingSubsets


def filter_save_responses(

Check warning on line 5 in src/core/tasks/scheduled/impl/internet_archives/save/filter.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/filter.py#L5 <103>

Missing docstring in public function
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/filter.py:5:1: D103 Missing docstring in public function
resp_mappings: list[URLInternetArchivesSaveResponseMapping]
) -> IASaveURLMappingSubsets:
subsets = IASaveURLMappingSubsets()
for resp_mapping in resp_mappings:
if resp_mapping.response.has_error:
subsets.error.append(resp_mapping.response)
else:
subsets.success.append(resp_mapping.response)
return subsets

Check warning on line 14 in src/core/tasks/scheduled/impl/internet_archives/save/filter.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/filter.py#L14 <292>

no newline at end of file
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/filter.py:14:19: W292 no newline at end of file
18 changes: 18 additions & 0 deletions src/core/tasks/scheduled/impl/internet_archives/save/mapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from src.core.tasks.scheduled.impl.internet_archives.save.models.entry import InternetArchivesSaveTaskEntry

Check warning on line 1 in src/core/tasks/scheduled/impl/internet_archives/save/mapper.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/mapper.py#L1 <100>

Missing docstring in public module
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/mapper.py:1:1: D100 Missing docstring in public module


class URLToEntryMapper:

Check warning on line 4 in src/core/tasks/scheduled/impl/internet_archives/save/mapper.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/mapper.py#L4 <101>

Missing docstring in public class
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/mapper.py:4:1: D101 Missing docstring in public class

def __init__(self, entries: list[InternetArchivesSaveTaskEntry]):

Check warning on line 6 in src/core/tasks/scheduled/impl/internet_archives/save/mapper.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/mapper.py#L6 <107>

Missing docstring in __init__
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/mapper.py:6:1: D107 Missing docstring in __init__
self._url_to_entry: dict[str, InternetArchivesSaveTaskEntry] = {
entry.url: entry for entry in entries
}

def get_is_new(self, url: str) -> bool:

Check warning on line 11 in src/core/tasks/scheduled/impl/internet_archives/save/mapper.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/mapper.py#L11 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/mapper.py:11:1: D102 Missing docstring in public method
return self._url_to_entry[url].is_new

def get_url_id(self, url: str) -> int:

Check warning on line 14 in src/core/tasks/scheduled/impl/internet_archives/save/mapper.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/mapper.py#L14 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/mapper.py:14:1: D102 Missing docstring in public method
return self._url_to_entry[url].url_id

def get_all_urls(self) -> list[str]:

Check warning on line 17 in src/core/tasks/scheduled/impl/internet_archives/save/mapper.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/mapper.py#L17 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/mapper.py:17:1: D102 Missing docstring in public method
return list(self._url_to_entry.keys())
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from pydantic import BaseModel

Check warning on line 1 in src/core/tasks/scheduled/impl/internet_archives/save/models/entry.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/models/entry.py#L1 <100>

Missing docstring in public module
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/models/entry.py:1:1: D100 Missing docstring in public module

from src.db.dtos.url.mapping import URLMapping


class InternetArchivesSaveTaskEntry(BaseModel):

Check warning on line 6 in src/core/tasks/scheduled/impl/internet_archives/save/models/entry.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/models/entry.py#L6 <101>

Missing docstring in public class
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/models/entry.py:6:1: D101 Missing docstring in public class
url: str
url_id: int
is_new: bool

def to_url_mapping(self) -> URLMapping:

Check warning on line 11 in src/core/tasks/scheduled/impl/internet_archives/save/models/entry.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/models/entry.py#L11 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/models/entry.py:11:1: D102 Missing docstring in public method
return URLMapping(
url_id=self.url_id,
url=self.url
)

Check warning on line 15 in src/core/tasks/scheduled/impl/internet_archives/save/models/entry.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/models/entry.py#L15 <292>

no newline at end of file
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/models/entry.py:15:10: W292 no newline at end of file
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from pydantic import BaseModel

Check warning on line 1 in src/core/tasks/scheduled/impl/internet_archives/save/models/mapping.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/models/mapping.py#L1 <100>

Missing docstring in public module
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/models/mapping.py:1:1: D100 Missing docstring in public module

from src.external.internet_archives.models.save_response import InternetArchivesSaveResponseInfo


class URLInternetArchivesSaveResponseMapping(BaseModel):

Check warning on line 6 in src/core/tasks/scheduled/impl/internet_archives/save/models/mapping.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/models/mapping.py#L6 <101>

Missing docstring in public class
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/models/mapping.py:6:1: D101 Missing docstring in public class
url: str
response: InternetArchivesSaveResponseInfo

Check warning on line 8 in src/core/tasks/scheduled/impl/internet_archives/save/models/mapping.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/models/mapping.py#L8 <292>

no newline at end of file
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/models/mapping.py:8:47: W292 no newline at end of file
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from pydantic import BaseModel

Check warning on line 1 in src/core/tasks/scheduled/impl/internet_archives/save/models/subset.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/models/subset.py#L1 <100>

Missing docstring in public module
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/models/subset.py:1:1: D100 Missing docstring in public module

from src.core.tasks.scheduled.impl.internet_archives.save.models.mapping import URLInternetArchivesSaveResponseMapping


class IASaveURLMappingSubsets(BaseModel):

Check warning on line 6 in src/core/tasks/scheduled/impl/internet_archives/save/models/subset.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/models/subset.py#L6 <101>

Missing docstring in public class
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/models/subset.py:6:1: D101 Missing docstring in public class
error: list[URLInternetArchivesSaveResponseMapping] = []
success: list[URLInternetArchivesSaveResponseMapping] = []

Check warning on line 8 in src/core/tasks/scheduled/impl/internet_archives/save/models/subset.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/models/subset.py#L8 <292>

no newline at end of file
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/models/subset.py:8:63: W292 no newline at end of file
134 changes: 134 additions & 0 deletions src/core/tasks/scheduled/impl/internet_archives/save/operator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
from src.core.tasks.mixins.link_urls import LinkURLsMixin

Check warning on line 1 in src/core/tasks/scheduled/impl/internet_archives/save/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/operator.py#L1 <100>

Missing docstring in public module
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/operator.py:1:1: D100 Missing docstring in public module
from src.core.tasks.mixins.prereq import HasPrerequisitesMixin
from src.core.tasks.scheduled.impl.internet_archives.save.filter import filter_save_responses
from src.core.tasks.scheduled.impl.internet_archives.save.mapper import URLToEntryMapper
from src.core.tasks.scheduled.impl.internet_archives.save.models.entry import InternetArchivesSaveTaskEntry
from src.core.tasks.scheduled.impl.internet_archives.save.models.mapping import URLInternetArchivesSaveResponseMapping
from src.core.tasks.scheduled.impl.internet_archives.save.models.subset import IASaveURLMappingSubsets
from src.core.tasks.scheduled.impl.internet_archives.save.queries.get import \
GetURLsForInternetArchivesSaveTaskQueryBuilder
from src.core.tasks.scheduled.impl.internet_archives.save.queries.prereq import \
MeetsPrerequisitesForInternetArchivesSaveQueryBuilder
from src.core.tasks.scheduled.impl.internet_archives.save.queries.update import \
UpdateInternetArchivesSaveMetadataQueryBuilder
from src.core.tasks.scheduled.templates.operator import ScheduledTaskOperatorBase
from src.db.client.async_ import AsyncDatabaseClient
from src.db.enums import TaskType
from src.db.models.impl.url.error_info.pydantic import URLErrorPydanticInfo
from src.db.models.impl.url.internet_archives.save.pydantic import URLInternetArchiveSaveMetadataPydantic
from src.external.internet_archives.client import InternetArchivesClient
from src.external.internet_archives.models.save_response import InternetArchivesSaveResponseInfo


class InternetArchivesSaveTaskOperator(

Check warning on line 23 in src/core/tasks/scheduled/impl/internet_archives/save/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/operator.py#L23 <101>

Missing docstring in public class
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/operator.py:23:1: D101 Missing docstring in public class
ScheduledTaskOperatorBase,
HasPrerequisitesMixin,
LinkURLsMixin
):

def __init__(

Check warning on line 29 in src/core/tasks/scheduled/impl/internet_archives/save/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/operator.py#L29 <107>

Missing docstring in __init__
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/operator.py:29:1: D107 Missing docstring in __init__
self,
adb_client: AsyncDatabaseClient,
ia_client: InternetArchivesClient
):
super().__init__(adb_client)
self.ia_client = ia_client

async def meets_task_prerequisites(self) -> bool:

Check warning on line 37 in src/core/tasks/scheduled/impl/internet_archives/save/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/operator.py#L37 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/operator.py:37:1: D102 Missing docstring in public method
return await self.adb_client.run_query_builder(
MeetsPrerequisitesForInternetArchivesSaveQueryBuilder()
)

@property
def task_type(self) -> TaskType:

Check warning on line 43 in src/core/tasks/scheduled/impl/internet_archives/save/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/operator.py#L43 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/operator.py:43:1: D102 Missing docstring in public method
return TaskType.IA_SAVE

async def inner_task_logic(self) -> None:

Check warning on line 46 in src/core/tasks/scheduled/impl/internet_archives/save/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/operator.py#L46 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/operator.py:46:1: D102 Missing docstring in public method
entries: list[InternetArchivesSaveTaskEntry] = await self._get_valid_urls()
mapper = URLToEntryMapper(entries)
url_ids = [entry.url_id for entry in entries]
await self.link_urls_to_task(url_ids=url_ids)

# Save all to internet archives and get responses
resp_mappings: list[URLInternetArchivesSaveResponseMapping] = await self._save_all_to_internet_archives(
mapper.get_all_urls()
)

# Separate errors from successful saves
subsets: IASaveURLMappingSubsets = filter_save_responses(resp_mappings)

# Save errors
await self._add_errors_to_db(mapper, responses=subsets.error)

# Save successful saves that are new archive entries
await self._save_new_saves_to_db(mapper, ia_mappings=subsets.success)

# Save successful saves that are existing archive entries
await self._save_existing_saves_to_db(mapper, ia_mappings=subsets.success)



async def _save_all_to_internet_archives(self, urls: list[str]) -> list[URLInternetArchivesSaveResponseMapping]:

Check failure on line 71 in src/core/tasks/scheduled/impl/internet_archives/save/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/operator.py#L71 <303>

too many blank lines (3)
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/operator.py:71:5: E303 too many blank lines (3)
resp_mappings: list[URLInternetArchivesSaveResponseMapping] = []
for url in urls:
resp: InternetArchivesSaveResponseInfo = await self.ia_client.save_to_internet_archives(url)
mapping = URLInternetArchivesSaveResponseMapping(
url=url,
response=resp
)
resp_mappings.append(mapping)
return resp_mappings

async def _get_valid_urls(self) -> list[InternetArchivesSaveTaskEntry]:
return await self.adb_client.run_query_builder(
GetURLsForInternetArchivesSaveTaskQueryBuilder()
)

async def _add_errors_to_db(
self,
mapper: URLToEntryMapper,
responses: list[InternetArchivesSaveResponseInfo]
) -> None:
error_info_list: list[URLErrorPydanticInfo] = []
for response in responses:
url_id = mapper.get_url_id(response.url)
url_error_info = URLErrorPydanticInfo(
url_id=url_id,
error=response.error,
task_id=self.task_id
)
error_info_list.append(url_error_info)
await self.adb_client.bulk_insert(error_info_list)

async def _save_new_saves_to_db(
self,
mapper: URLToEntryMapper,
ia_mappings: list[URLInternetArchivesSaveResponseMapping]
) -> None:
insert_objects: list[URLInternetArchiveSaveMetadataPydantic] = []
for ia_mapping in ia_mappings:
is_new = mapper.get_is_new(ia_mapping.url)
if not is_new:
continue
insert_object = URLInternetArchiveSaveMetadataPydantic(
url_id=mapper.get_url_id(ia_mapping.url),
)
insert_objects.append(insert_object)
await self.adb_client.bulk_insert(insert_objects)

async def _save_existing_saves_to_db(
self,
mapper: URLToEntryMapper,
ia_mappings: list[URLInternetArchivesSaveResponseMapping]
) -> None:
url_ids: list[int] = []
for ia_mapping in ia_mappings:
is_new = mapper.get_is_new(ia_mapping.url)
if is_new:
continue
url_ids.append(mapper.get_url_id(ia_mapping.url))
await self.adb_client.run_query_builder(
UpdateInternetArchivesSaveMetadataQueryBuilder(
url_ids=url_ids
)
)

Check warning on line 134 in src/core/tasks/scheduled/impl/internet_archives/save/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/operator.py#L134 <292>

no newline at end of file
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/operator.py:134:10: W292 no newline at end of file
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from typing import Sequence

Check warning on line 1 in src/core/tasks/scheduled/impl/internet_archives/save/queries/get.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/queries/get.py#L1 <100>

Missing docstring in public module
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/queries/get.py:1:1: D100 Missing docstring in public module

from sqlalchemy import RowMapping
from sqlalchemy.ext.asyncio import AsyncSession

from src.core.tasks.scheduled.impl.internet_archives.save.models.entry import InternetArchivesSaveTaskEntry
from src.core.tasks.scheduled.impl.internet_archives.save.queries.shared.get_valid_entries import \
IA_SAVE_VALID_ENTRIES_QUERY
from src.db.helpers.session import session_helper as sh
from src.db.queries.base.builder import QueryBuilderBase


class GetURLsForInternetArchivesSaveTaskQueryBuilder(QueryBuilderBase):

Check warning on line 13 in src/core/tasks/scheduled/impl/internet_archives/save/queries/get.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/queries/get.py#L13 <101>

Missing docstring in public class
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/queries/get.py:13:1: D101 Missing docstring in public class

async def run(self, session: AsyncSession) -> list[InternetArchivesSaveTaskEntry]:

Check warning on line 15 in src/core/tasks/scheduled/impl/internet_archives/save/queries/get.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/queries/get.py#L15 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/queries/get.py:15:1: D102 Missing docstring in public method
query = (
IA_SAVE_VALID_ENTRIES_QUERY
# Limit to 15, which is the maximum number of URLs that can be saved at once.
.limit(15)
)

db_mappings: Sequence[RowMapping] = await sh.mappings(session, query=query)
return [
InternetArchivesSaveTaskEntry(
url_id=mapping["id"],
url=mapping["url"],
is_new=mapping["is_new"],
) for mapping in db_mappings
]

Check warning on line 29 in src/core/tasks/scheduled/impl/internet_archives/save/queries/get.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/impl/internet_archives/save/queries/get.py#L29 <292>

no newline at end of file
Raw output
./src/core/tasks/scheduled/impl/internet_archives/save/queries/get.py:29:10: W292 no newline at end of file
Loading