Skip to content

[Data] br_inep_enem#1431

Open
thaismdr wants to merge 39 commits intomainfrom
update_enem
Open

[Data] br_inep_enem#1431
thaismdr wants to merge 39 commits intomainfrom
update_enem

Conversation

@thaismdr
Copy link
Copy Markdown
Contributor

@thaismdr thaismdr commented Feb 11, 2026

Template Pull Requests - Pipeline

Descrição do PR:

  • Subida de novas tabelas do ENEM (participantes, resultados, questionario_socioeconomico_2024) - atualização para 2024.

Detalhes Técnicos:

  • Atualiza a estrutura dos dados do ENEM 2024 para refletir mudanças metodológicas do INEP motivadas pela LGPD. Os microdados deixam de ser uma base única e passam a ser organizados em múltiplas tabelas (participantes, resultados, itens), seguindo a nova forma de divulgação oficial.

  • O código da escola de conclusão do ensino médio (id_escola) tem aplicação de máscara para instituições com menos de 10 participantes (LGPD)

    • Principais alterações na pipeline/scripts: <!-- Criação de novas tabelas, seguindo reestruturação da fonte
    • Mudanças nos dados e no schema: <!-- Devido à mascara que o Enem aplica a escolas com menos de 10 alunos que realizaram a prova e a demais problemas com o id_escola na tabela de Diretórios, o id_escola foi retirado dos testes na tabela de resultados

Teste e Validações:

  • Relate os testes e validações relacionado aos dados/script:

    • Testado localmente - dbt test: participantes, resultados, questionario_socioeconomico_2024 passaram
    • Testado na Cloud - checagem do número de linhas, linhas por UF, count de valores únicos dos do id_inscricao e id_sequencial

    Caso haja algo relacionado aos testes que vale a pena informar: retirada de id_escola dos testes da tabela de resultados

Summary by CodeRabbit

Release Notes

  • New Features

    • Added three new ENEM datasets: participants, exam results, and socioeconomic questionnaires (2024+).
    • ENEM microdata now distributed across separate tables instead of a single file.
    • School identifiers are masked for institutions with fewer than 10 participants.
  • Documentation

    • Updated guidance on data structure changes effective 2024.

@thaismdr thaismdr requested a review from a team February 11, 2026 20:58
@thaismdr thaismdr self-assigned this Feb 11, 2026
@thaismdr thaismdr added the table-approve Triggers Table Approve on PR merge label Feb 11, 2026
@folhesgabriel folhesgabriel added the test-dev-model Run DBT tests in the modified models using basedosdados-dev Bigquery Project label Feb 13, 2026
@folhesgabriel
Copy link
Copy Markdown
Collaborator

@thaismdr pode verificar o teste de dicionários? Tiveram erros no teste ustom_dictionary_coverage_br_inep_enem__questionario_socioeconomico

import pandas as pd

warnings.filterwarnings("ignore")

Copy link
Copy Markdown
Collaborator

@folhesgabriel folhesgabriel Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pelo que vi a função to_partitions é igual a função que temos no repositório de pipelines pipelines/utils.py -> to_partitions; Se for, importe ela diretamente ao invés de copiar e colar no script de cada tabela de tratamento. Assim evitamos duplicação de código sem necessidade :)

@tricktx
Copy link
Copy Markdown
Contributor

tricktx commented Mar 10, 2026

@thaismdr conseguiu validar os testes? Precisa de alguma ajuda?

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

A new ENEM dataset pipeline is introduced with three SQL models that normalize and partition staging source data into BigQuery tables, complemented by Python ingestion scripts that read raw CSV from external sources, apply transformations, and write partitioned outputs. Comprehensive schema definitions with validation tests and documentation are also added.

Changes

Cohort / File(s) Summary
SQL Data Models
models/br_inep_enem/br_inep_enem__participantes.sql, models/br_inep_enem/br_inep_enem__resultados.sql, models/br_inep_enem/br_inep_enem__questionario_socioeconomico_2024.sql
New dbt models sourcing from staging tables with safe_cast type normalization (identifiers/categories to string, years/grades to numeric types) and BigQuery partitioning by ano (2024–2025 range). Metadata labels include project_id=basedosdados and tema=educacao.
Python Data Ingestion & Partitioning
models/br_inep_enem/code/participantes.py, models/br_inep_enem/code/resultados.py
New modules implementing CSV ingestion from Google Drive with chunked reading, column rename/coercion, and to_partitions() utility for writing Hive-style partitioned output (CSV with append mode or Parquet with gzip compression) by ano. Executes immediately on import.
Schema & Model Definitions
models/br_inep_enem/schema.yml
Added three model declarations with comprehensive column definitions, uniqueness/relationship tests (cross-referencing br_bd_diretorios_brasil__municipio and br_bd_diretorios_brasil__uf), and custom_dictionary_coverage validation against the ENEM dictionary.
Metadata & Documentation
models/br_inep_enem/readme.md, models/br_inep_enem/br_inep_enem__dicionario.sql
Added documentation clarifying the 2024 structural shift from single base to multi-table distribution (participants, results, socioeconomic questionnaire), note on id_escola masking for <10 participant institutions, and dictionary update comment (Feb-2026).

Sequence Diagram(s)

sequenceDiagram
    participant GDrive as Google Drive<br/>(Raw CSV)
    participant PythonETL as Python ETL<br/>(participantes.py,<br/>resultados.py)
    participant PartStorage as Partitioned Storage<br/>(Hive-style)
    participant Staging as Staging Source<br/>(br_inep_enem_staging)
    participant SQLModels as dbt SQL Models
    participant BQ as BigQuery<br/>(br_inep_enem)

    GDrive->>PythonETL: Read CSV (PARTICIPANTES_2024.csv,<br/>RESULTADOS_2024.csv) in chunks
    PythonETL->>PythonETL: Transform & coerce columns<br/>to standardized types
    PythonETL->>PartStorage: Write partitioned by ano<br/>(key=value/data.csv or .parquet)
    PartStorage->>Staging: Data loaded into<br/>staging tables
    
    Staging->>SQLModels: Source data available<br/>(br_inep_enem_staging.*)
    SQLModels->>SQLModels: safe_cast columns<br/>to target types
    SQLModels->>SQLModels: Apply partitioning<br/>config (ano range 2024–2025)
    SQLModels->>BQ: Materialized output<br/>(participantes, resultados,<br/>questionario_socioeconomico_2024)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐰 Hop hop, the data flows
From drives to partitions, our CSV glows
Cast and partition, by ano we go,
ENEM tables bloom in BigQuery's glow! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title '[Data] br_inep_enem' is concise, follows the repository's naming convention, and directly identifies the main change (new ENEM data tables for 2024).
Description check ✅ Passed The description covers key sections from the template: PR objective (new ENEM tables), technical details (data structure changes, LGPD masking), and testing/validation. However, Risk/Mitigation and Dependencies sections are incomplete or missing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch update_enem

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@models/br_inep_enem/br_inep_enem__dicionario.sql`:
- Line 3: The test spec br_inep_enem__questionario_socioeconomico_2024 uses
uppercase column names Q001–Q023 but the SQL model defines them as lowercase
q001–q023; update the test specification to use lowercase column names (q001
through q023) so custom_dictionary_coverage, which matches nome_coluna
case-sensitively, will find the dictionary entries; ensure all references in
that spec to Q001–Q023 are changed to q001–q023.

In `@models/br_inep_enem/code/resultados.py`:
- Around line 89-91: The module currently uses a hard-coded global
caminho_leitura and calls read_csv_enem() on import, causing an immediate
ingestion; refactor by converting read_csv_enem into a typed function signature
(e.g., def read_csv_enem(caminho_leitura: str, caminho_saida: str) -> None) with
a Google-style docstring, remove reliance on the module-level caminho_leitura
variable, make the progress counter a local variable inside read_csv_enem, and
move the actual invocation into an if __name__ == "__main__": block that parses
or passes concrete paths; update any other functions referenced in this flow to
include type hints as needed.
- Around line 65-76: The CSV branch currently always appends to an existing
partition file (file_filter_save_path) via df_filter.to_csv(..., mode="a",
header=not file_filter_save_path.exists()), which causes duplicate rows on
retries; change the write strategy in the CSV path when file_type == "csv" to
either (a) remove/clear the existing file_filter_save_path once at the start of
the run before any chunk writes, or (b) write chunks into a fresh temporary
directory/temporary file and only atomically move/publish the final data.csv to
file_filter_save_path after the full load completes; apply the same fix to the
identical helper in models/br_inep_enem/code/participantes.py and ensure usage
of df_filter and file_filter_save_path is preserved when switching from append
to overwrite/publish.
- Around line 53-58: The current multi-column partition filter using
DataFrame.isin(...).all(axis=1) in the df_filter construction incorrectly
matches values regardless of column mapping; change it to build a column-wise
boolean mask by iterating over filter_combination.items() and combining
(data[col] == value) with & so each column is matched to its specific value
(refer to df_filter, data, filter_combination). Also add missing return type
annotations and a short docstring to read_csv_enem() and move the module-level
invocation into an if __name__ == "__main__": block so the module can be safely
imported.

In `@models/br_inep_enem/readme.md`:
- Around line 5-13: The headings "Testes realizados" and "Mudanças na
organização dos dados" are currently H3 (###) after the document H1, triggering
MD001; change those headings to H2 (##) so they follow the top-level heading
correctly. Locate the lines containing "### Testes realizados" and "### Mudanças
na organização dos dados" in the README and replace the triple-# with double-#
for each heading to resolve the markdownlint MD001 warning.

In `@models/br_inep_enem/schema.yml`:
- Around line 321-322: Update the description for the field id_escola in the
schema to state that values are masked for schools with fewer than 10
participants (i.e., not a raw INEP identifier) so downstream consumers don't
treat it as a reliable join key; edit the description text for the id_escola
field in models/br_inep_enem/schema.yml to explicitly mention
masking/obfuscation policy and any implications for joins or uniqueness.
- Around line 284-295: The descriptions for the fields id_municipio_prova and
sigla_uf_prova are incorrect (they currently say the school location) and will
publish wrong catalog metadata; update the description text for
id_municipio_prova and sigla_uf_prova to clearly state they refer to the
municipality and state where the prova (test/exam) took place (e.g., "ID do
município onde a prova foi realizada - IBGE 7 Dígitos" and "Sigla da unidade da
federação onde a prova foi realizada"), leaving the existing relationship tests
to ref('br_bd_diretorios_brasil__municipio') and
ref('br_bd_diretorios_brasil__uf') unchanged.
- Around line 431-456: The custom_dictionary_coverage block currently lists
columns as Q001–Q023 but the model and 2023 sibling use lowercase names
(q001–q023), causing the coverage macro
(tests-dbt/generic/custom_dictionary_coverage.sql) to miss matches; update the
columns_covered_by_dictionary entries in the custom_dictionary_coverage mapping
to use the lowercase names (q001 through q023) so they match the model and the
dictionary lookup (leave dictionary_model: ref('br_inep_enem__dicionario')
unchanged).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e229039c-1a33-4169-9a13-6801f0d3e149

📥 Commits

Reviewing files that changed from the base of the PR and between 1183f22 and e8c9a07.

📒 Files selected for processing (8)
  • models/br_inep_enem/br_inep_enem__dicionario.sql
  • models/br_inep_enem/br_inep_enem__participantes.sql
  • models/br_inep_enem/br_inep_enem__questionario_socioeconomico_2024.sql
  • models/br_inep_enem/br_inep_enem__resultados.sql
  • models/br_inep_enem/code/participantes.py
  • models/br_inep_enem/code/resultados.py
  • models/br_inep_enem/readme.md
  • models/br_inep_enem/schema.yml

@@ -1,5 +1,6 @@
{{ config(alias="dicionario", schema="br_inep_enem") }}

-- atualizado em fev-2026. Acrescenta o dicionários das tabelas de 2024
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's explore the repository structure
find . -type f -name "*.sql" | grep -E "(br_inep_enem|dicionario)" | head -20

Repository: basedosdados/pipelines

Length of output: 1262


🏁 Script executed:

# Check the actual content of the file in question
cat -n models/br_inep_enem/br_inep_enem__dicionario.sql

Repository: basedosdados/pipelines

Length of output: 579


🏁 Script executed:

# Search for validation rules or tests related to dictionary coverage
rg "custom_dictionary_coverage_br_inep_enem__questionario_socioeconomico" --type sql --type yaml --type py

Repository: basedosdados/pipelines

Length of output: 48


🏁 Script executed:

# Look for the staging data source referenced in the model
rg "br_inep_enem_staging\.dicionario" --type sql -A 5 -B 5

Repository: basedosdados/pipelines

Length of output: 667


🏁 Script executed:

# Search for validation rules in YAML files
rg "custom_dictionary_coverage|questionario_socioeconomico" --type yaml -A 3 -B 3

Repository: basedosdados/pipelines

Length of output: 37563


🏁 Script executed:

# Look for dbt tests or validations related to br_inep_enem
find . -type f \( -name "*.yml" -o -name "*.yaml" \) | xargs grep -l "br_inep_enem" | head -10

Repository: basedosdados/pipelines

Length of output: 116


🏁 Script executed:

# Search for any mention of dictionary coverage tests
rg "dictionary_coverage|dicionario.*coverage" --type yaml --type sql -i

Repository: basedosdados/pipelines

Length of output: 4128


🏁 Script executed:

# Check the schema.yml or test files in br_inep_enem directory
ls -la models/br_inep_enem/

Repository: basedosdados/pipelines

Length of output: 3160


🏁 Script executed:

# Find and display the test configuration for questionario_socioeconomico_2024
grep -A 15 "name: br_inep_enem__questionario_socioeconomico_2024" models/br_inep_enem/schema.yml | head -30

Repository: basedosdados/pipelines

Length of output: 546


🏁 Script executed:

# Look at the test definition itself
cat tests-dbt/generic/custom_dictionary_coverage.sql

Repository: basedosdados/pipelines

Length of output: 2262


🏁 Script executed:

# Check what columns are actually in the 2024 questionario model
cat -n models/br_inep_enem/br_inep_enem__questionario_socioeconomico_2024.sql

Repository: basedosdados/pipelines

Length of output: 1342


🏁 Script executed:

# Look at the full schema.yml section for questionario_socioeconomico_2024 to see all tested columns
grep -A 50 "name: br_inep_enem__questionario_socioeconomico_2024" models/br_inep_enem/schema.yml | grep -A 45 "custom_dictionary_coverage:"

Repository: basedosdados/pipelines

Length of output: 1772


🏁 Script executed:

# Check the 2023 version for comparison
grep -A 50 "name: br_inep_enem__questionario_socioeconomico_2023" models/br_inep_enem/schema.yml | grep -A 45 "custom_dictionary_coverage:"

Repository: basedosdados/pipelines

Length of output: 1691


🏁 Script executed:

# Check if there are any seed files or data updates for the dictionary
find . -type f \( -name "*.csv" -o -name "*.yml" \) | grep -i "dict\|seed" | head -20

Repository: basedosdados/pipelines

Length of output: 48


🏁 Script executed:

# Look for staging references or source definitions
rg "br_inep_enem_staging" --type yaml --type sql -B 2 -A 2 | head -40

Repository: basedosdados/pipelines

Length of output: 3007


🏁 Script executed:

# Check if there's a sources.yml or similar that defines br_inep_enem_staging
find . -name "sources.yml" -o -name "*staging*" -type f | head -20

Repository: basedosdados/pipelines

Length of output: 101


🏁 Script executed:

# Search for any .github workflows or CI config that might show test results
find .github -name "*.yml" -o -name "*.yaml" | head -10

Repository: basedosdados/pipelines

Length of output: 382


🏁 Script executed:

# Check dbt_project.yml or other configs that might show test execution
cat dbt_project.yml | head -50

Repository: basedosdados/pipelines

Length of output: 2117


🏁 Script executed:

# Search for any recent changes or comments about dictionary coverage failures
git log --oneline --all -20 2>/dev/null | head -20

Repository: basedosdados/pipelines

Length of output: 1124


🏁 Script executed:

# Look for any documentation about the custom_dictionary_coverage test
rg "dictionary.*coverage|coverage.*dictionary" --type md -B 2 -A 2

Repository: basedosdados/pipelines

Length of output: 48


🏁 Script executed:

# Check the actual column name casing in schema.yml for 2024 test more carefully
sed -n '/name: br_inep_enem__questionario_socioeconomico_2024/,/name: br_inep_enem__questionario_socioeconomico_2023/p' models/br_inep_enem/schema.yml | grep -A 30 "custom_dictionary_coverage:"

Repository: basedosdados/pipelines

Length of output: 844


🏁 Script executed:

# Check if there are any recent PRs or branches with updates to the staging data
git branch -a 2>/dev/null | grep -i "enem\|dict\|2024" | head -10

Repository: basedosdados/pipelines

Length of output: 48


🏁 Script executed:

# Check the 2023 questionario SQL model for column naming
cat -n models/br_inep_enem/br_inep_enem__questionario_socioeconomico_2023.sql | head -30

Repository: basedosdados/pipelines

Length of output: 1317


🏁 Script executed:

# Check if there's a pattern - look at what case the older models use
rg "safe_cast\(q[0-9]" models/br_inep_enem/ | head -10

Repository: basedosdados/pipelines

Length of output: 995


🏁 Script executed:

# Verify the exact case used in 2023 test spec vs 2024
echo "=== 2023 test spec ===" && \
sed -n '/name: br_inep_enem__questionario_socioeconomico_2023/,/name: br_inep_enem__questionario_socioeconomico_2024/p' models/br_inep_enem/schema.yml | grep -A 30 "custom_dictionary_coverage:" | head -35 && \
echo -e "\n=== 2024 test spec ===" && \
sed -n '/name: br_inep_enem__questionario_socioeconomico_2024/,/columns:/p' models/br_inep_enem/schema.yml | grep -A 30 "custom_dictionary_coverage:"

Repository: basedosdados/pipelines

Length of output: 813


🏁 Script executed:

# Get just the 2024 test spec clearly
awk '/name: br_inep_enem__questionario_socioeconomico_2024/,/name: [a-z_]*$/ {print NR": "$0}' models/br_inep_enem/schema.yml | grep -A 35 "custom_dictionary_coverage:" | head -40

Repository: basedosdados/pipelines

Length of output: 819


🏁 Script executed:

# Alternative: extract from line numbers
grep -n "name: br_inep_enem__questionario_socioeconomico_2024" models/br_inep_enem/schema.yml

Repository: basedosdados/pipelines

Length of output: 126


🏁 Script executed:

# Confirm the exact line numbers and compare
sed -n '405,430p' models/br_inep_enem/schema.yml | grep -A 25 "custom_dictionary_coverage:"

Repository: basedosdados/pipelines

Length of output: 48


🏁 Script executed:

# Check if the dictionary test is sensitive to column name casing
cat tests-dbt/generic/custom_dictionary_coverage.sql | grep -A 5 "nome_coluna"

Repository: basedosdados/pipelines

Length of output: 403


🏁 Script executed:

# Check the dictionary contents to see what column names it has for questionario tables
# Look for patterns in older questionario entries
rg "questionario_socioeconomico" models/br_inep_enem/schema.yml | grep "nome_coluna\|Q0\|q0" | head -20

Repository: basedosdados/pipelines

Length of output: 48


🏁 Script executed:

# Get a clearer picture - show the exact test definition lines for both 2023 and 2024
sed -n '405,460p' models/br_inep_enem/schema.yml

Repository: basedosdados/pipelines

Length of output: 1884


🏁 Script executed:

# Check the git diff or changes to see if dictionario entries were added
git diff HEAD~1..HEAD models/br_inep_enem/br_inep_enem__dicionario.sql 2>/dev/null || echo "No git history available"

Repository: basedosdados/pipelines

Length of output: 90


🏁 Script executed:

# Look at what columns the 2023 model test actually specifies to confirm the pattern
sed -n '400,425p' models/br_inep_enem/schema.yml

Repository: basedosdados/pipelines

Length of output: 1279


🏁 Script executed:

# Get the 2023 section line by line to see the exact column names in its test
grep -n "name: br_inep_enem__questionario_socioeconomico_2023" models/br_inep_enem/schema.yml

Repository: basedosdados/pipelines

Length of output: 126


🏁 Script executed:

# Show that range
sed -n '365,430p' models/br_inep_enem/schema.yml

Repository: basedosdados/pipelines

Length of output: 3317


🏁 Script executed:

# Show the 2023 model test definition
sed -n '516,560p' models/br_inep_enem/schema.yml

Repository: basedosdados/pipelines

Length of output: 1557


Fix column name case mismatch in 2024 questionario test specification.

The br_inep_enem__questionario_socioeconomico_2024 test spec references uppercase column names (Q001–Q023) but the SQL model defines them in lowercase (q001–q023). The custom_dictionary_coverage test searches the dictionary for an exact case match on nome_coluna, so it will fail to find entries. Match the 2023 pattern: use lowercase column names (q001–q023) in the test specification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_inep_enem/br_inep_enem__dicionario.sql` at line 3, The test spec
br_inep_enem__questionario_socioeconomico_2024 uses uppercase column names
Q001–Q023 but the SQL model defines them as lowercase q001–q023; update the test
specification to use lowercase column names (q001 through q023) so
custom_dictionary_coverage, which matches nome_coluna case-sensitively, will
find the dictionary entries; ensure all references in that spec to Q001–Q023 are
changed to q001–q023.

Comment on lines +53 to +58
df_filter = data.loc[
data[filter_combination.keys()]
.isin(filter_combination.values())
.all(axis=1),
:,
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Reproduces the current multi-column partition filter behavior.
python - <<'PY'
import pandas as pd

df = pd.DataFrame(
    {"ano": [2024, 1], "mes": [1, 2024], "row": ["expected", "wrong"]}
)
fc = {"ano": 2024, "mes": 1}

current = df.loc[
    df[list(fc.keys())].isin(list(fc.values())).all(axis=1),
    "row",
].tolist()
expected = df.loc[
    df["ano"].eq(2024) & df["mes"].eq(1),
    "row",
].tolist()

print("current_filter:", current)
print("expected_filter:", expected)
PY

Repository: basedosdados/pipelines

Length of output: 184


🏁 Script executed:

# Locate the file
fd "resultados.py" --type f

Repository: basedosdados/pipelines

Length of output: 104


🏁 Script executed:

# If found, check the file content around lines 53-58
cat -n models/br_inep_enem/code/resultados.py | head -100

Repository: basedosdados/pipelines

Length of output: 4020


🏁 Script executed:

# Continue reading the file to see the complete read_csv_enem function
cat -n models/br_inep_enem/code/resultados.py | tail -150

Repository: basedosdados/pipelines

Length of output: 7192


Fix multi-column partition filter to match by column-value pair, not by value set membership.

DataFrame.isin(filter_combination.values()).all(axis=1) matches any row where all values exist somewhere in the value set, ignoring which value belongs to which column. For example, with {"ano": 2024, "mes": 1}, it will incorrectly match ano=1, mes=2024. Although the current call uses only one partition column (line 240), the function advertises multi-column support (docstring example), making this a latent bug. Replace with column-matched equality:

🔧 Suggested fix
-            df_filter = data.loc[
-                data[filter_combination.keys()]
-                .isin(filter_combination.values())
-                .all(axis=1),
-                :,
-            ]
+            mask = pd.Series(True, index=data.index)
+            for column, value in filter_combination.items():
+                mask &= data[column].eq(value)
+            df_filter = data.loc[mask, :]

Additionally, add missing return type annotations and docstring to read_csv_enem() per guidelines (lines 94–244), and move the module-level function call (line 246) to a if __name__ == "__main__": block to allow safe imports.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
df_filter = data.loc[
data[filter_combination.keys()]
.isin(filter_combination.values())
.all(axis=1),
:,
]
mask = pd.Series(True, index=data.index)
for column, value in filter_combination.items():
mask &= data[column].eq(value)
df_filter = data.loc[mask, :]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_inep_enem/code/resultados.py` around lines 53 - 58, The current
multi-column partition filter using DataFrame.isin(...).all(axis=1) in the
df_filter construction incorrectly matches values regardless of column mapping;
change it to build a column-wise boolean mask by iterating over
filter_combination.items() and combining (data[col] == value) with & so each
column is matched to its specific value (refer to df_filter, data,
filter_combination). Also add missing return type annotations and a short
docstring to read_csv_enem() and move the module-level invocation into an if
__name__ == "__main__": block so the module can be safely imported.

Comment on lines +65 to +76
if file_type == "csv":
# append data to csv
file_filter_save_path = Path(filter_save_path) / "data.csv"
df_filter.to_csv(
file_filter_save_path,
sep=",",
encoding="utf-8",
na_rep="",
index=False,
mode="a",
header=not file_filter_save_path.exists(),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Avoid append-only writes to a stable output path.

Because Line 74 always appends to the existing partition file, rerunning or retrying this loader duplicates every chunk that was already written. Please clear the destination once per run or write into a fresh temp directory and publish only after the load finishes. The identical helper in models/br_inep_enem/code/participantes.py has the same corruption risk.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_inep_enem/code/resultados.py` around lines 65 - 76, The CSV branch
currently always appends to an existing partition file (file_filter_save_path)
via df_filter.to_csv(..., mode="a", header=not file_filter_save_path.exists()),
which causes duplicate rows on retries; change the write strategy in the CSV
path when file_type == "csv" to either (a) remove/clear the existing
file_filter_save_path once at the start of the run before any chunk writes, or
(b) write chunks into a fresh temporary directory/temporary file and only
atomically move/publish the final data.csv to file_filter_save_path after the
full load completes; apply the same fix to the identical helper in
models/br_inep_enem/code/participantes.py and ensure usage of df_filter and
file_filter_save_path is preserved when switching from append to
overwrite/publish.

Comment on lines +89 to +91
caminho_leitura = (
"/content/drive/MyDrive/conjuntos/dados_brutos/br_inep_enem/microdados/"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make this an explicit script entrypoint.

The hard-coded /content/drive/... paths plus the bare read_csv_enem() call at Line 246 mean any import triggers a full local ingest. Please pass the paths into a typed read_csv_enem(...)->None, move the invocation under if __name__ == "__main__":, and keep the progress counter local.

♻️ Suggested refactor
-valor = 0
-
-caminho_leitura = (
-    "/content/drive/MyDrive/conjuntos/dados_brutos/br_inep_enem/microdados/"
-)
-
-
-def read_csv_enem():
-    global valor
-    for df in pd.read_csv(
-        caminho_leitura + "RESULTADOS_2024.csv",
+DEFAULT_INPUT_PATH = (
+    "/content/drive/MyDrive/conjuntos/dados_brutos/br_inep_enem/microdados/"
+    "RESULTADOS_2024.csv"
+)
+DEFAULT_OUTPUT_PATH = "/content/drive/MyDrive/conjuntos/br_inep_enem/resultados/"
+
+
+def read_csv_enem(input_path: str, output_path: str) -> None:
+    """Read the ENEM resultados CSV and write partitioned files.
+
+    Args:
+        input_path: Source CSV path.
+        output_path: Root directory for partitioned output.
+    """
+    for chunk_number, df in enumerate(
+        pd.read_csv(
+            input_path,
             sep=";",
             encoding="latin1",
             chunksize=100000,
-    ):
-        valor = valor + 1
-        print(valor)
+        ),
+        start=1,
+    ):
+        print(chunk_number)
+        # ... existing transformation code ...
         to_partitions(
             data=df_lista,
             partition_columns=["ano"],
-            savepath="/content/drive/MyDrive/conjuntos/br_inep_enem/resultados/",
+            savepath=output_path,
             file_type="csv",
         )
 
 
-read_csv_enem()
+if __name__ == "__main__":
+    read_csv_enem(DEFAULT_INPUT_PATH, DEFAULT_OUTPUT_PATH)

As per coding guidelines, **/*.py: Add type hints and docstrings for Python functions following Google Style.

Also applies to: 94-103, 238-246

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_inep_enem/code/resultados.py` around lines 89 - 91, The module
currently uses a hard-coded global caminho_leitura and calls read_csv_enem() on
import, causing an immediate ingestion; refactor by converting read_csv_enem
into a typed function signature (e.g., def read_csv_enem(caminho_leitura: str,
caminho_saida: str) -> None) with a Google-style docstring, remove reliance on
the module-level caminho_leitura variable, make the progress counter a local
variable inside read_csv_enem, and move the actual invocation into an if
__name__ == "__main__": block that parses or passes concrete paths; update any
other functions referenced in this flow to include type hints as needed.

Comment on lines +5 to +13
### Testes realizados

---
* Sobre a tabela ``resultados``
**Relação id_escola (Relationship: ID Escola)**
O código real da escola foi substituído por uma máscara quando a instituição possui menos de 10 participantes no exame. Como consequência, o campo de identificação da escola nem sempre corresponde ao código oficial, inviabilizando a validação direta dessa chave.

---
### Mudanças na organização dos dados
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use ## for these section headings.

After the H1 on Line 1, jumping straight to H3 triggers the current markdownlint MD001 warning.

📝 Suggested fix
-### Testes realizados
+## Testes realizados

-### Mudanças na organização dos dados
+## Mudanças na organização dos dados
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Testes realizados
---
* Sobre a tabela ``resultados``
**Relação id_escola (Relationship: ID Escola)**
O código real da escola foi substituído por uma máscara quando a instituição possui menos de 10 participantes no exame. Como consequência, o campo de identificação da escola nem sempre corresponde ao código oficial, inviabilizando a validação direta dessa chave.
---
### Mudanças na organização dos dados
## Testes realizados
---
* Sobre a tabela ``resultados``
**Relação id_escola (Relationship: ID Escola)**
O código real da escola foi substituído por uma máscara quando a instituição possui menos de 10 participantes no exame. Como consequência, o campo de identificação da escola nem sempre corresponde ao código oficial, inviabilizando a validação direta dessa chave.
---
## Mudanças na organização dos dados
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_inep_enem/readme.md` around lines 5 - 13, The headings "Testes
realizados" and "Mudanças na organização dos dados" are currently H3 (###) after
the document H1, triggering MD001; change those headings to H2 (##) so they
follow the top-level heading correctly. Locate the lines containing "### Testes
realizados" and "### Mudanças na organização dos dados" in the README and
replace the triple-# with double-# for each heading to resolve the markdownlint
MD001 warning.

Comment on lines +284 to +295
- name: id_municipio_prova
description: ID Município da escola em que estudou - IBGE 7 Dígitos
tests:
- relationships:
to: ref('br_bd_diretorios_brasil__municipio')
field: id_municipio
- name: sigla_uf_prova
description: Sigla da unidade da federação da escola em que estudou
tests:
- relationships:
to: ref('br_bd_diretorios_brasil__uf')
field: sigla
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Correct the prova geography descriptions.

id_municipio_prova and sigla_uf_prova are documented as school-location fields. That will publish the wrong catalog metadata for these columns.

📝 Proposed fix
       - name: id_municipio_prova
-        description: ID Município da escola em que estudou - IBGE 7 Dígitos
+        description: ID Município da cidade da prova - IBGE 7 Dígitos
         tests:
           - relationships:
               to: ref('br_bd_diretorios_brasil__municipio')
               field: id_municipio
       - name: sigla_uf_prova
-        description: Sigla da unidade da federação da escola em que estudou
+        description: Sigla da unidade da federação da prova
         tests:
           - relationships:
               to: ref('br_bd_diretorios_brasil__uf')
               field: sigla
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_inep_enem/schema.yml` around lines 284 - 295, The descriptions for
the fields id_municipio_prova and sigla_uf_prova are incorrect (they currently
say the school location) and will publish wrong catalog metadata; update the
description text for id_municipio_prova and sigla_uf_prova to clearly state they
refer to the municipality and state where the prova (test/exam) took place
(e.g., "ID do município onde a prova foi realizada - IBGE 7 Dígitos" and "Sigla
da unidade da federação onde a prova foi realizada"), leaving the existing
relationship tests to ref('br_bd_diretorios_brasil__municipio') and
ref('br_bd_diretorios_brasil__uf') unchanged.

Comment on lines +321 to +322
- name: id_escola
description: ID Escola - Inep
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document that id_escola is masked.

This field is no longer a plain INEP identifier: this PR masks schools with fewer than 10 participants. Leaving the description generic makes downstream joins look valid when they are intentionally obfuscated.

📝 Proposed fix
       - name: id_escola
-        description: ID Escola - Inep
+        description: ID Escola - Inep. Valores são mascarados para escolas com menos de 10 participantes, conforme regra de LGPD.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: id_escola
description: ID Escola - Inep
- name: id_escola
description: ID Escola - Inep. Valores são mascarados para escolas com menos de 10 participantes, conforme regra de LGPD.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_inep_enem/schema.yml` around lines 321 - 322, Update the
description for the field id_escola in the schema to state that values are
masked for schools with fewer than 10 participants (i.e., not a raw INEP
identifier) so downstream consumers don't treat it as a reliable join key; edit
the description text for the id_escola field in models/br_inep_enem/schema.yml
to explicitly mention masking/obfuscation policy and any implications for joins
or uniqueness.

Comment on lines +431 to +456
- custom_dictionary_coverage:
columns_covered_by_dictionary:
- Q001
- Q002
- Q003
- Q004
- Q005
- Q006
- Q007
- Q008
- Q009
- Q010
- Q011
- Q012
- Q013
- Q014
- Q015
- Q016
- Q017
- Q018
- Q019
- Q020
- Q021
- Q022
- Q023
dictionary_model: ref('br_inep_enem__dicionario')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align custom_dictionary_coverage with the 2024 column names.

The macro in tests-dbt/generic/custom_dictionary_coverage.sql compares nome_coluna with the exact string listed here. This block uses Q001Q023, while this model declares q001q023 and the 2023 sibling follows the same lowercase convention, so the coverage lookup will miss the 2024 dictionary rows and fail on populated values.

🛠️ Proposed fix
       - custom_dictionary_coverage:
           columns_covered_by_dictionary:
-            - Q001
-            - Q002
-            - Q003
-            - Q004
-            - Q005
-            - Q006
-            - Q007
-            - Q008
-            - Q009
-            - Q010
-            - Q011
-            - Q012
-            - Q013
-            - Q014
-            - Q015
-            - Q016
-            - Q017
-            - Q018
-            - Q019
-            - Q020
-            - Q021
-            - Q022
-            - Q023
+            - q001
+            - q002
+            - q003
+            - q004
+            - q005
+            - q006
+            - q007
+            - q008
+            - q009
+            - q010
+            - q011
+            - q012
+            - q013
+            - q014
+            - q015
+            - q016
+            - q017
+            - q018
+            - q019
+            - q020
+            - q021
+            - q022
+            - q023
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- custom_dictionary_coverage:
columns_covered_by_dictionary:
- Q001
- Q002
- Q003
- Q004
- Q005
- Q006
- Q007
- Q008
- Q009
- Q010
- Q011
- Q012
- Q013
- Q014
- Q015
- Q016
- Q017
- Q018
- Q019
- Q020
- Q021
- Q022
- Q023
dictionary_model: ref('br_inep_enem__dicionario')
- custom_dictionary_coverage:
columns_covered_by_dictionary:
- q001
- q002
- q003
- q004
- q005
- q006
- q007
- q008
- q009
- q010
- q011
- q012
- q013
- q014
- q015
- q016
- q017
- q018
- q019
- q020
- q021
- q022
- q023
dictionary_model: ref('br_inep_enem__dicionario')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_inep_enem/schema.yml` around lines 431 - 456, The
custom_dictionary_coverage block currently lists columns as Q001–Q023 but the
model and 2023 sibling use lowercase names (q001–q023), causing the coverage
macro (tests-dbt/generic/custom_dictionary_coverage.sql) to miss matches; update
the columns_covered_by_dictionary entries in the custom_dictionary_coverage
mapping to use the lowercase names (q001 through q023) so they match the model
and the dictionary lookup (leave dictionary_model:
ref('br_inep_enem__dicionario') unchanged).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

table-approve Triggers Table Approve on PR merge test-dev-model Run DBT tests in the modified models using basedosdados-dev Bigquery Project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants