Conversation
|
@thaismdr pode verificar o teste de dicionários? Tiveram erros no teste ustom_dictionary_coverage_br_inep_enem__questionario_socioeconomico |
| import pandas as pd | ||
|
|
||
| warnings.filterwarnings("ignore") | ||
|
|
There was a problem hiding this comment.
Pelo que vi a função to_partitions é igual a função que temos no repositório de pipelines pipelines/utils.py -> to_partitions; Se for, importe ela diretamente ao invés de copiar e colar no script de cada tabela de tratamento. Assim evitamos duplicação de código sem necessidade :)
|
@thaismdr conseguiu validar os testes? Precisa de alguma ajuda? |
📝 WalkthroughWalkthroughA new ENEM dataset pipeline is introduced with three SQL models that normalize and partition staging source data into BigQuery tables, complemented by Python ingestion scripts that read raw CSV from external sources, apply transformations, and write partitioned outputs. Comprehensive schema definitions with validation tests and documentation are also added. Changes
Sequence Diagram(s)sequenceDiagram
participant GDrive as Google Drive<br/>(Raw CSV)
participant PythonETL as Python ETL<br/>(participantes.py,<br/>resultados.py)
participant PartStorage as Partitioned Storage<br/>(Hive-style)
participant Staging as Staging Source<br/>(br_inep_enem_staging)
participant SQLModels as dbt SQL Models
participant BQ as BigQuery<br/>(br_inep_enem)
GDrive->>PythonETL: Read CSV (PARTICIPANTES_2024.csv,<br/>RESULTADOS_2024.csv) in chunks
PythonETL->>PythonETL: Transform & coerce columns<br/>to standardized types
PythonETL->>PartStorage: Write partitioned by ano<br/>(key=value/data.csv or .parquet)
PartStorage->>Staging: Data loaded into<br/>staging tables
Staging->>SQLModels: Source data available<br/>(br_inep_enem_staging.*)
SQLModels->>SQLModels: safe_cast columns<br/>to target types
SQLModels->>SQLModels: Apply partitioning<br/>config (ano range 2024–2025)
SQLModels->>BQ: Materialized output<br/>(participantes, resultados,<br/>questionario_socioeconomico_2024)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@models/br_inep_enem/br_inep_enem__dicionario.sql`:
- Line 3: The test spec br_inep_enem__questionario_socioeconomico_2024 uses
uppercase column names Q001–Q023 but the SQL model defines them as lowercase
q001–q023; update the test specification to use lowercase column names (q001
through q023) so custom_dictionary_coverage, which matches nome_coluna
case-sensitively, will find the dictionary entries; ensure all references in
that spec to Q001–Q023 are changed to q001–q023.
In `@models/br_inep_enem/code/resultados.py`:
- Around line 89-91: The module currently uses a hard-coded global
caminho_leitura and calls read_csv_enem() on import, causing an immediate
ingestion; refactor by converting read_csv_enem into a typed function signature
(e.g., def read_csv_enem(caminho_leitura: str, caminho_saida: str) -> None) with
a Google-style docstring, remove reliance on the module-level caminho_leitura
variable, make the progress counter a local variable inside read_csv_enem, and
move the actual invocation into an if __name__ == "__main__": block that parses
or passes concrete paths; update any other functions referenced in this flow to
include type hints as needed.
- Around line 65-76: The CSV branch currently always appends to an existing
partition file (file_filter_save_path) via df_filter.to_csv(..., mode="a",
header=not file_filter_save_path.exists()), which causes duplicate rows on
retries; change the write strategy in the CSV path when file_type == "csv" to
either (a) remove/clear the existing file_filter_save_path once at the start of
the run before any chunk writes, or (b) write chunks into a fresh temporary
directory/temporary file and only atomically move/publish the final data.csv to
file_filter_save_path after the full load completes; apply the same fix to the
identical helper in models/br_inep_enem/code/participantes.py and ensure usage
of df_filter and file_filter_save_path is preserved when switching from append
to overwrite/publish.
- Around line 53-58: The current multi-column partition filter using
DataFrame.isin(...).all(axis=1) in the df_filter construction incorrectly
matches values regardless of column mapping; change it to build a column-wise
boolean mask by iterating over filter_combination.items() and combining
(data[col] == value) with & so each column is matched to its specific value
(refer to df_filter, data, filter_combination). Also add missing return type
annotations and a short docstring to read_csv_enem() and move the module-level
invocation into an if __name__ == "__main__": block so the module can be safely
imported.
In `@models/br_inep_enem/readme.md`:
- Around line 5-13: The headings "Testes realizados" and "Mudanças na
organização dos dados" are currently H3 (###) after the document H1, triggering
MD001; change those headings to H2 (##) so they follow the top-level heading
correctly. Locate the lines containing "### Testes realizados" and "### Mudanças
na organização dos dados" in the README and replace the triple-# with double-#
for each heading to resolve the markdownlint MD001 warning.
In `@models/br_inep_enem/schema.yml`:
- Around line 321-322: Update the description for the field id_escola in the
schema to state that values are masked for schools with fewer than 10
participants (i.e., not a raw INEP identifier) so downstream consumers don't
treat it as a reliable join key; edit the description text for the id_escola
field in models/br_inep_enem/schema.yml to explicitly mention
masking/obfuscation policy and any implications for joins or uniqueness.
- Around line 284-295: The descriptions for the fields id_municipio_prova and
sigla_uf_prova are incorrect (they currently say the school location) and will
publish wrong catalog metadata; update the description text for
id_municipio_prova and sigla_uf_prova to clearly state they refer to the
municipality and state where the prova (test/exam) took place (e.g., "ID do
município onde a prova foi realizada - IBGE 7 Dígitos" and "Sigla da unidade da
federação onde a prova foi realizada"), leaving the existing relationship tests
to ref('br_bd_diretorios_brasil__municipio') and
ref('br_bd_diretorios_brasil__uf') unchanged.
- Around line 431-456: The custom_dictionary_coverage block currently lists
columns as Q001–Q023 but the model and 2023 sibling use lowercase names
(q001–q023), causing the coverage macro
(tests-dbt/generic/custom_dictionary_coverage.sql) to miss matches; update the
columns_covered_by_dictionary entries in the custom_dictionary_coverage mapping
to use the lowercase names (q001 through q023) so they match the model and the
dictionary lookup (leave dictionary_model: ref('br_inep_enem__dicionario')
unchanged).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e229039c-1a33-4169-9a13-6801f0d3e149
📒 Files selected for processing (8)
models/br_inep_enem/br_inep_enem__dicionario.sqlmodels/br_inep_enem/br_inep_enem__participantes.sqlmodels/br_inep_enem/br_inep_enem__questionario_socioeconomico_2024.sqlmodels/br_inep_enem/br_inep_enem__resultados.sqlmodels/br_inep_enem/code/participantes.pymodels/br_inep_enem/code/resultados.pymodels/br_inep_enem/readme.mdmodels/br_inep_enem/schema.yml
| @@ -1,5 +1,6 @@ | |||
| {{ config(alias="dicionario", schema="br_inep_enem") }} | |||
|
|
|||
| -- atualizado em fev-2026. Acrescenta o dicionários das tabelas de 2024 | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's explore the repository structure
find . -type f -name "*.sql" | grep -E "(br_inep_enem|dicionario)" | head -20Repository: basedosdados/pipelines
Length of output: 1262
🏁 Script executed:
# Check the actual content of the file in question
cat -n models/br_inep_enem/br_inep_enem__dicionario.sqlRepository: basedosdados/pipelines
Length of output: 579
🏁 Script executed:
# Search for validation rules or tests related to dictionary coverage
rg "custom_dictionary_coverage_br_inep_enem__questionario_socioeconomico" --type sql --type yaml --type pyRepository: basedosdados/pipelines
Length of output: 48
🏁 Script executed:
# Look for the staging data source referenced in the model
rg "br_inep_enem_staging\.dicionario" --type sql -A 5 -B 5Repository: basedosdados/pipelines
Length of output: 667
🏁 Script executed:
# Search for validation rules in YAML files
rg "custom_dictionary_coverage|questionario_socioeconomico" --type yaml -A 3 -B 3Repository: basedosdados/pipelines
Length of output: 37563
🏁 Script executed:
# Look for dbt tests or validations related to br_inep_enem
find . -type f \( -name "*.yml" -o -name "*.yaml" \) | xargs grep -l "br_inep_enem" | head -10Repository: basedosdados/pipelines
Length of output: 116
🏁 Script executed:
# Search for any mention of dictionary coverage tests
rg "dictionary_coverage|dicionario.*coverage" --type yaml --type sql -iRepository: basedosdados/pipelines
Length of output: 4128
🏁 Script executed:
# Check the schema.yml or test files in br_inep_enem directory
ls -la models/br_inep_enem/Repository: basedosdados/pipelines
Length of output: 3160
🏁 Script executed:
# Find and display the test configuration for questionario_socioeconomico_2024
grep -A 15 "name: br_inep_enem__questionario_socioeconomico_2024" models/br_inep_enem/schema.yml | head -30Repository: basedosdados/pipelines
Length of output: 546
🏁 Script executed:
# Look at the test definition itself
cat tests-dbt/generic/custom_dictionary_coverage.sqlRepository: basedosdados/pipelines
Length of output: 2262
🏁 Script executed:
# Check what columns are actually in the 2024 questionario model
cat -n models/br_inep_enem/br_inep_enem__questionario_socioeconomico_2024.sqlRepository: basedosdados/pipelines
Length of output: 1342
🏁 Script executed:
# Look at the full schema.yml section for questionario_socioeconomico_2024 to see all tested columns
grep -A 50 "name: br_inep_enem__questionario_socioeconomico_2024" models/br_inep_enem/schema.yml | grep -A 45 "custom_dictionary_coverage:"Repository: basedosdados/pipelines
Length of output: 1772
🏁 Script executed:
# Check the 2023 version for comparison
grep -A 50 "name: br_inep_enem__questionario_socioeconomico_2023" models/br_inep_enem/schema.yml | grep -A 45 "custom_dictionary_coverage:"Repository: basedosdados/pipelines
Length of output: 1691
🏁 Script executed:
# Check if there are any seed files or data updates for the dictionary
find . -type f \( -name "*.csv" -o -name "*.yml" \) | grep -i "dict\|seed" | head -20Repository: basedosdados/pipelines
Length of output: 48
🏁 Script executed:
# Look for staging references or source definitions
rg "br_inep_enem_staging" --type yaml --type sql -B 2 -A 2 | head -40Repository: basedosdados/pipelines
Length of output: 3007
🏁 Script executed:
# Check if there's a sources.yml or similar that defines br_inep_enem_staging
find . -name "sources.yml" -o -name "*staging*" -type f | head -20Repository: basedosdados/pipelines
Length of output: 101
🏁 Script executed:
# Search for any .github workflows or CI config that might show test results
find .github -name "*.yml" -o -name "*.yaml" | head -10Repository: basedosdados/pipelines
Length of output: 382
🏁 Script executed:
# Check dbt_project.yml or other configs that might show test execution
cat dbt_project.yml | head -50Repository: basedosdados/pipelines
Length of output: 2117
🏁 Script executed:
# Search for any recent changes or comments about dictionary coverage failures
git log --oneline --all -20 2>/dev/null | head -20Repository: basedosdados/pipelines
Length of output: 1124
🏁 Script executed:
# Look for any documentation about the custom_dictionary_coverage test
rg "dictionary.*coverage|coverage.*dictionary" --type md -B 2 -A 2Repository: basedosdados/pipelines
Length of output: 48
🏁 Script executed:
# Check the actual column name casing in schema.yml for 2024 test more carefully
sed -n '/name: br_inep_enem__questionario_socioeconomico_2024/,/name: br_inep_enem__questionario_socioeconomico_2023/p' models/br_inep_enem/schema.yml | grep -A 30 "custom_dictionary_coverage:"Repository: basedosdados/pipelines
Length of output: 844
🏁 Script executed:
# Check if there are any recent PRs or branches with updates to the staging data
git branch -a 2>/dev/null | grep -i "enem\|dict\|2024" | head -10Repository: basedosdados/pipelines
Length of output: 48
🏁 Script executed:
# Check the 2023 questionario SQL model for column naming
cat -n models/br_inep_enem/br_inep_enem__questionario_socioeconomico_2023.sql | head -30Repository: basedosdados/pipelines
Length of output: 1317
🏁 Script executed:
# Check if there's a pattern - look at what case the older models use
rg "safe_cast\(q[0-9]" models/br_inep_enem/ | head -10Repository: basedosdados/pipelines
Length of output: 995
🏁 Script executed:
# Verify the exact case used in 2023 test spec vs 2024
echo "=== 2023 test spec ===" && \
sed -n '/name: br_inep_enem__questionario_socioeconomico_2023/,/name: br_inep_enem__questionario_socioeconomico_2024/p' models/br_inep_enem/schema.yml | grep -A 30 "custom_dictionary_coverage:" | head -35 && \
echo -e "\n=== 2024 test spec ===" && \
sed -n '/name: br_inep_enem__questionario_socioeconomico_2024/,/columns:/p' models/br_inep_enem/schema.yml | grep -A 30 "custom_dictionary_coverage:"Repository: basedosdados/pipelines
Length of output: 813
🏁 Script executed:
# Get just the 2024 test spec clearly
awk '/name: br_inep_enem__questionario_socioeconomico_2024/,/name: [a-z_]*$/ {print NR": "$0}' models/br_inep_enem/schema.yml | grep -A 35 "custom_dictionary_coverage:" | head -40Repository: basedosdados/pipelines
Length of output: 819
🏁 Script executed:
# Alternative: extract from line numbers
grep -n "name: br_inep_enem__questionario_socioeconomico_2024" models/br_inep_enem/schema.ymlRepository: basedosdados/pipelines
Length of output: 126
🏁 Script executed:
# Confirm the exact line numbers and compare
sed -n '405,430p' models/br_inep_enem/schema.yml | grep -A 25 "custom_dictionary_coverage:"Repository: basedosdados/pipelines
Length of output: 48
🏁 Script executed:
# Check if the dictionary test is sensitive to column name casing
cat tests-dbt/generic/custom_dictionary_coverage.sql | grep -A 5 "nome_coluna"Repository: basedosdados/pipelines
Length of output: 403
🏁 Script executed:
# Check the dictionary contents to see what column names it has for questionario tables
# Look for patterns in older questionario entries
rg "questionario_socioeconomico" models/br_inep_enem/schema.yml | grep "nome_coluna\|Q0\|q0" | head -20Repository: basedosdados/pipelines
Length of output: 48
🏁 Script executed:
# Get a clearer picture - show the exact test definition lines for both 2023 and 2024
sed -n '405,460p' models/br_inep_enem/schema.ymlRepository: basedosdados/pipelines
Length of output: 1884
🏁 Script executed:
# Check the git diff or changes to see if dictionario entries were added
git diff HEAD~1..HEAD models/br_inep_enem/br_inep_enem__dicionario.sql 2>/dev/null || echo "No git history available"Repository: basedosdados/pipelines
Length of output: 90
🏁 Script executed:
# Look at what columns the 2023 model test actually specifies to confirm the pattern
sed -n '400,425p' models/br_inep_enem/schema.ymlRepository: basedosdados/pipelines
Length of output: 1279
🏁 Script executed:
# Get the 2023 section line by line to see the exact column names in its test
grep -n "name: br_inep_enem__questionario_socioeconomico_2023" models/br_inep_enem/schema.ymlRepository: basedosdados/pipelines
Length of output: 126
🏁 Script executed:
# Show that range
sed -n '365,430p' models/br_inep_enem/schema.ymlRepository: basedosdados/pipelines
Length of output: 3317
🏁 Script executed:
# Show the 2023 model test definition
sed -n '516,560p' models/br_inep_enem/schema.ymlRepository: basedosdados/pipelines
Length of output: 1557
Fix column name case mismatch in 2024 questionario test specification.
The br_inep_enem__questionario_socioeconomico_2024 test spec references uppercase column names (Q001–Q023) but the SQL model defines them in lowercase (q001–q023). The custom_dictionary_coverage test searches the dictionary for an exact case match on nome_coluna, so it will fail to find entries. Match the 2023 pattern: use lowercase column names (q001–q023) in the test specification.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@models/br_inep_enem/br_inep_enem__dicionario.sql` at line 3, The test spec
br_inep_enem__questionario_socioeconomico_2024 uses uppercase column names
Q001–Q023 but the SQL model defines them as lowercase q001–q023; update the test
specification to use lowercase column names (q001 through q023) so
custom_dictionary_coverage, which matches nome_coluna case-sensitively, will
find the dictionary entries; ensure all references in that spec to Q001–Q023 are
changed to q001–q023.
| df_filter = data.loc[ | ||
| data[filter_combination.keys()] | ||
| .isin(filter_combination.values()) | ||
| .all(axis=1), | ||
| :, | ||
| ] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Reproduces the current multi-column partition filter behavior.
python - <<'PY'
import pandas as pd
df = pd.DataFrame(
{"ano": [2024, 1], "mes": [1, 2024], "row": ["expected", "wrong"]}
)
fc = {"ano": 2024, "mes": 1}
current = df.loc[
df[list(fc.keys())].isin(list(fc.values())).all(axis=1),
"row",
].tolist()
expected = df.loc[
df["ano"].eq(2024) & df["mes"].eq(1),
"row",
].tolist()
print("current_filter:", current)
print("expected_filter:", expected)
PYRepository: basedosdados/pipelines
Length of output: 184
🏁 Script executed:
# Locate the file
fd "resultados.py" --type fRepository: basedosdados/pipelines
Length of output: 104
🏁 Script executed:
# If found, check the file content around lines 53-58
cat -n models/br_inep_enem/code/resultados.py | head -100Repository: basedosdados/pipelines
Length of output: 4020
🏁 Script executed:
# Continue reading the file to see the complete read_csv_enem function
cat -n models/br_inep_enem/code/resultados.py | tail -150Repository: basedosdados/pipelines
Length of output: 7192
Fix multi-column partition filter to match by column-value pair, not by value set membership.
DataFrame.isin(filter_combination.values()).all(axis=1) matches any row where all values exist somewhere in the value set, ignoring which value belongs to which column. For example, with {"ano": 2024, "mes": 1}, it will incorrectly match ano=1, mes=2024. Although the current call uses only one partition column (line 240), the function advertises multi-column support (docstring example), making this a latent bug. Replace with column-matched equality:
🔧 Suggested fix
- df_filter = data.loc[
- data[filter_combination.keys()]
- .isin(filter_combination.values())
- .all(axis=1),
- :,
- ]
+ mask = pd.Series(True, index=data.index)
+ for column, value in filter_combination.items():
+ mask &= data[column].eq(value)
+ df_filter = data.loc[mask, :]Additionally, add missing return type annotations and docstring to read_csv_enem() per guidelines (lines 94–244), and move the module-level function call (line 246) to a if __name__ == "__main__": block to allow safe imports.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| df_filter = data.loc[ | |
| data[filter_combination.keys()] | |
| .isin(filter_combination.values()) | |
| .all(axis=1), | |
| :, | |
| ] | |
| mask = pd.Series(True, index=data.index) | |
| for column, value in filter_combination.items(): | |
| mask &= data[column].eq(value) | |
| df_filter = data.loc[mask, :] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@models/br_inep_enem/code/resultados.py` around lines 53 - 58, The current
multi-column partition filter using DataFrame.isin(...).all(axis=1) in the
df_filter construction incorrectly matches values regardless of column mapping;
change it to build a column-wise boolean mask by iterating over
filter_combination.items() and combining (data[col] == value) with & so each
column is matched to its specific value (refer to df_filter, data,
filter_combination). Also add missing return type annotations and a short
docstring to read_csv_enem() and move the module-level invocation into an if
__name__ == "__main__": block so the module can be safely imported.
| if file_type == "csv": | ||
| # append data to csv | ||
| file_filter_save_path = Path(filter_save_path) / "data.csv" | ||
| df_filter.to_csv( | ||
| file_filter_save_path, | ||
| sep=",", | ||
| encoding="utf-8", | ||
| na_rep="", | ||
| index=False, | ||
| mode="a", | ||
| header=not file_filter_save_path.exists(), | ||
| ) |
There was a problem hiding this comment.
Avoid append-only writes to a stable output path.
Because Line 74 always appends to the existing partition file, rerunning or retrying this loader duplicates every chunk that was already written. Please clear the destination once per run or write into a fresh temp directory and publish only after the load finishes. The identical helper in models/br_inep_enem/code/participantes.py has the same corruption risk.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@models/br_inep_enem/code/resultados.py` around lines 65 - 76, The CSV branch
currently always appends to an existing partition file (file_filter_save_path)
via df_filter.to_csv(..., mode="a", header=not file_filter_save_path.exists()),
which causes duplicate rows on retries; change the write strategy in the CSV
path when file_type == "csv" to either (a) remove/clear the existing
file_filter_save_path once at the start of the run before any chunk writes, or
(b) write chunks into a fresh temporary directory/temporary file and only
atomically move/publish the final data.csv to file_filter_save_path after the
full load completes; apply the same fix to the identical helper in
models/br_inep_enem/code/participantes.py and ensure usage of df_filter and
file_filter_save_path is preserved when switching from append to
overwrite/publish.
| caminho_leitura = ( | ||
| "/content/drive/MyDrive/conjuntos/dados_brutos/br_inep_enem/microdados/" | ||
| ) |
There was a problem hiding this comment.
Make this an explicit script entrypoint.
The hard-coded /content/drive/... paths plus the bare read_csv_enem() call at Line 246 mean any import triggers a full local ingest. Please pass the paths into a typed read_csv_enem(...)->None, move the invocation under if __name__ == "__main__":, and keep the progress counter local.
♻️ Suggested refactor
-valor = 0
-
-caminho_leitura = (
- "/content/drive/MyDrive/conjuntos/dados_brutos/br_inep_enem/microdados/"
-)
-
-
-def read_csv_enem():
- global valor
- for df in pd.read_csv(
- caminho_leitura + "RESULTADOS_2024.csv",
+DEFAULT_INPUT_PATH = (
+ "/content/drive/MyDrive/conjuntos/dados_brutos/br_inep_enem/microdados/"
+ "RESULTADOS_2024.csv"
+)
+DEFAULT_OUTPUT_PATH = "/content/drive/MyDrive/conjuntos/br_inep_enem/resultados/"
+
+
+def read_csv_enem(input_path: str, output_path: str) -> None:
+ """Read the ENEM resultados CSV and write partitioned files.
+
+ Args:
+ input_path: Source CSV path.
+ output_path: Root directory for partitioned output.
+ """
+ for chunk_number, df in enumerate(
+ pd.read_csv(
+ input_path,
sep=";",
encoding="latin1",
chunksize=100000,
- ):
- valor = valor + 1
- print(valor)
+ ),
+ start=1,
+ ):
+ print(chunk_number)
+ # ... existing transformation code ...
to_partitions(
data=df_lista,
partition_columns=["ano"],
- savepath="/content/drive/MyDrive/conjuntos/br_inep_enem/resultados/",
+ savepath=output_path,
file_type="csv",
)
-read_csv_enem()
+if __name__ == "__main__":
+ read_csv_enem(DEFAULT_INPUT_PATH, DEFAULT_OUTPUT_PATH)As per coding guidelines, **/*.py: Add type hints and docstrings for Python functions following Google Style.
Also applies to: 94-103, 238-246
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@models/br_inep_enem/code/resultados.py` around lines 89 - 91, The module
currently uses a hard-coded global caminho_leitura and calls read_csv_enem() on
import, causing an immediate ingestion; refactor by converting read_csv_enem
into a typed function signature (e.g., def read_csv_enem(caminho_leitura: str,
caminho_saida: str) -> None) with a Google-style docstring, remove reliance on
the module-level caminho_leitura variable, make the progress counter a local
variable inside read_csv_enem, and move the actual invocation into an if
__name__ == "__main__": block that parses or passes concrete paths; update any
other functions referenced in this flow to include type hints as needed.
| ### Testes realizados | ||
|
|
||
| --- | ||
| * Sobre a tabela ``resultados`` | ||
| **Relação id_escola (Relationship: ID Escola)** | ||
| O código real da escola foi substituído por uma máscara quando a instituição possui menos de 10 participantes no exame. Como consequência, o campo de identificação da escola nem sempre corresponde ao código oficial, inviabilizando a validação direta dessa chave. | ||
|
|
||
| --- | ||
| ### Mudanças na organização dos dados |
There was a problem hiding this comment.
Use ## for these section headings.
After the H1 on Line 1, jumping straight to H3 triggers the current markdownlint MD001 warning.
📝 Suggested fix
-### Testes realizados
+## Testes realizados
-### Mudanças na organização dos dados
+## Mudanças na organização dos dados📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### Testes realizados | |
| --- | |
| * Sobre a tabela ``resultados`` | |
| **Relação id_escola (Relationship: ID Escola)** | |
| O código real da escola foi substituído por uma máscara quando a instituição possui menos de 10 participantes no exame. Como consequência, o campo de identificação da escola nem sempre corresponde ao código oficial, inviabilizando a validação direta dessa chave. | |
| --- | |
| ### Mudanças na organização dos dados | |
| ## Testes realizados | |
| --- | |
| * Sobre a tabela ``resultados`` | |
| **Relação id_escola (Relationship: ID Escola)** | |
| O código real da escola foi substituído por uma máscara quando a instituição possui menos de 10 participantes no exame. Como consequência, o campo de identificação da escola nem sempre corresponde ao código oficial, inviabilizando a validação direta dessa chave. | |
| --- | |
| ## Mudanças na organização dos dados |
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@models/br_inep_enem/readme.md` around lines 5 - 13, The headings "Testes
realizados" and "Mudanças na organização dos dados" are currently H3 (###) after
the document H1, triggering MD001; change those headings to H2 (##) so they
follow the top-level heading correctly. Locate the lines containing "### Testes
realizados" and "### Mudanças na organização dos dados" in the README and
replace the triple-# with double-# for each heading to resolve the markdownlint
MD001 warning.
| - name: id_municipio_prova | ||
| description: ID Município da escola em que estudou - IBGE 7 Dígitos | ||
| tests: | ||
| - relationships: | ||
| to: ref('br_bd_diretorios_brasil__municipio') | ||
| field: id_municipio | ||
| - name: sigla_uf_prova | ||
| description: Sigla da unidade da federação da escola em que estudou | ||
| tests: | ||
| - relationships: | ||
| to: ref('br_bd_diretorios_brasil__uf') | ||
| field: sigla |
There was a problem hiding this comment.
Correct the prova geography descriptions.
id_municipio_prova and sigla_uf_prova are documented as school-location fields. That will publish the wrong catalog metadata for these columns.
📝 Proposed fix
- name: id_municipio_prova
- description: ID Município da escola em que estudou - IBGE 7 Dígitos
+ description: ID Município da cidade da prova - IBGE 7 Dígitos
tests:
- relationships:
to: ref('br_bd_diretorios_brasil__municipio')
field: id_municipio
- name: sigla_uf_prova
- description: Sigla da unidade da federação da escola em que estudou
+ description: Sigla da unidade da federação da prova
tests:
- relationships:
to: ref('br_bd_diretorios_brasil__uf')
field: sigla🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@models/br_inep_enem/schema.yml` around lines 284 - 295, The descriptions for
the fields id_municipio_prova and sigla_uf_prova are incorrect (they currently
say the school location) and will publish wrong catalog metadata; update the
description text for id_municipio_prova and sigla_uf_prova to clearly state they
refer to the municipality and state where the prova (test/exam) took place
(e.g., "ID do município onde a prova foi realizada - IBGE 7 Dígitos" and "Sigla
da unidade da federação onde a prova foi realizada"), leaving the existing
relationship tests to ref('br_bd_diretorios_brasil__municipio') and
ref('br_bd_diretorios_brasil__uf') unchanged.
| - name: id_escola | ||
| description: ID Escola - Inep |
There was a problem hiding this comment.
Document that id_escola is masked.
This field is no longer a plain INEP identifier: this PR masks schools with fewer than 10 participants. Leaving the description generic makes downstream joins look valid when they are intentionally obfuscated.
📝 Proposed fix
- name: id_escola
- description: ID Escola - Inep
+ description: ID Escola - Inep. Valores são mascarados para escolas com menos de 10 participantes, conforme regra de LGPD.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: id_escola | |
| description: ID Escola - Inep | |
| - name: id_escola | |
| description: ID Escola - Inep. Valores são mascarados para escolas com menos de 10 participantes, conforme regra de LGPD. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@models/br_inep_enem/schema.yml` around lines 321 - 322, Update the
description for the field id_escola in the schema to state that values are
masked for schools with fewer than 10 participants (i.e., not a raw INEP
identifier) so downstream consumers don't treat it as a reliable join key; edit
the description text for the id_escola field in models/br_inep_enem/schema.yml
to explicitly mention masking/obfuscation policy and any implications for joins
or uniqueness.
| - custom_dictionary_coverage: | ||
| columns_covered_by_dictionary: | ||
| - Q001 | ||
| - Q002 | ||
| - Q003 | ||
| - Q004 | ||
| - Q005 | ||
| - Q006 | ||
| - Q007 | ||
| - Q008 | ||
| - Q009 | ||
| - Q010 | ||
| - Q011 | ||
| - Q012 | ||
| - Q013 | ||
| - Q014 | ||
| - Q015 | ||
| - Q016 | ||
| - Q017 | ||
| - Q018 | ||
| - Q019 | ||
| - Q020 | ||
| - Q021 | ||
| - Q022 | ||
| - Q023 | ||
| dictionary_model: ref('br_inep_enem__dicionario') |
There was a problem hiding this comment.
Align custom_dictionary_coverage with the 2024 column names.
The macro in tests-dbt/generic/custom_dictionary_coverage.sql compares nome_coluna with the exact string listed here. This block uses Q001–Q023, while this model declares q001–q023 and the 2023 sibling follows the same lowercase convention, so the coverage lookup will miss the 2024 dictionary rows and fail on populated values.
🛠️ Proposed fix
- custom_dictionary_coverage:
columns_covered_by_dictionary:
- - Q001
- - Q002
- - Q003
- - Q004
- - Q005
- - Q006
- - Q007
- - Q008
- - Q009
- - Q010
- - Q011
- - Q012
- - Q013
- - Q014
- - Q015
- - Q016
- - Q017
- - Q018
- - Q019
- - Q020
- - Q021
- - Q022
- - Q023
+ - q001
+ - q002
+ - q003
+ - q004
+ - q005
+ - q006
+ - q007
+ - q008
+ - q009
+ - q010
+ - q011
+ - q012
+ - q013
+ - q014
+ - q015
+ - q016
+ - q017
+ - q018
+ - q019
+ - q020
+ - q021
+ - q022
+ - q023📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - custom_dictionary_coverage: | |
| columns_covered_by_dictionary: | |
| - Q001 | |
| - Q002 | |
| - Q003 | |
| - Q004 | |
| - Q005 | |
| - Q006 | |
| - Q007 | |
| - Q008 | |
| - Q009 | |
| - Q010 | |
| - Q011 | |
| - Q012 | |
| - Q013 | |
| - Q014 | |
| - Q015 | |
| - Q016 | |
| - Q017 | |
| - Q018 | |
| - Q019 | |
| - Q020 | |
| - Q021 | |
| - Q022 | |
| - Q023 | |
| dictionary_model: ref('br_inep_enem__dicionario') | |
| - custom_dictionary_coverage: | |
| columns_covered_by_dictionary: | |
| - q001 | |
| - q002 | |
| - q003 | |
| - q004 | |
| - q005 | |
| - q006 | |
| - q007 | |
| - q008 | |
| - q009 | |
| - q010 | |
| - q011 | |
| - q012 | |
| - q013 | |
| - q014 | |
| - q015 | |
| - q016 | |
| - q017 | |
| - q018 | |
| - q019 | |
| - q020 | |
| - q021 | |
| - q022 | |
| - q023 | |
| dictionary_model: ref('br_inep_enem__dicionario') |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@models/br_inep_enem/schema.yml` around lines 431 - 456, The
custom_dictionary_coverage block currently lists columns as Q001–Q023 but the
model and 2023 sibling use lowercase names (q001–q023), causing the coverage
macro (tests-dbt/generic/custom_dictionary_coverage.sql) to miss matches; update
the columns_covered_by_dictionary entries in the custom_dictionary_coverage
mapping to use the lowercase names (q001 through q023) so they match the model
and the dictionary lookup (leave dictionary_model:
ref('br_inep_enem__dicionario') unchanged).
Template Pull Requests - Pipeline
Descrição do PR:
Detalhes Técnicos:
Atualiza a estrutura dos dados do ENEM 2024 para refletir mudanças metodológicas do INEP motivadas pela LGPD. Os microdados deixam de ser uma base única e passam a ser organizados em múltiplas tabelas (participantes, resultados, itens), seguindo a nova forma de divulgação oficial.
O código da escola de conclusão do ensino médio (id_escola) tem aplicação de máscara para instituições com menos de 10 participantes (LGPD)
Teste e Validações:
Relate os testes e validações relacionado aos dados/script:
Caso haja algo relacionado aos testes que vale a pena informar: retirada de id_escola dos testes da tabela de resultados
Summary by CodeRabbit
Release Notes
New Features
Documentation