[Data] br_inep_enem by thaismdr · Pull Request #1431 · basedosdados/pipelines

thaismdr · 2026-02-11T20:58:17Z

Template Pull Requests - Pipeline

Descrição do PR:

Subida de novas tabelas do ENEM (participantes, resultados, questionario_socioeconomico_2024) - atualização para 2024.

Detalhes Técnicos:

Atualiza a estrutura dos dados do ENEM 2024 para refletir mudanças metodológicas do INEP motivadas pela LGPD. Os microdados deixam de ser uma base única e passam a ser organizados em múltiplas tabelas (participantes, resultados, itens), seguindo a nova forma de divulgação oficial.
O código da escola de conclusão do ensino médio (id_escola) tem aplicação de máscara para instituições com menos de 10 participantes (LGPD)
- Principais alterações na pipeline/scripts: <!-- Criação de novas tabelas, seguindo reestruturação da fonte
- Mudanças nos dados e no schema: <!-- Devido à mascara que o Enem aplica a escolas com menos de 10 alunos que realizaram a prova e a demais problemas com o id_escola na tabela de Diretórios, o id_escola foi retirado dos testes na tabela de resultados

Teste e Validações:

Relate os testes e validações relacionado aos dados/script:
- Testado localmente - dbt test: participantes, resultados, questionario_socioeconomico_2024 passaram
- Testado na Cloud - checagem do número de linhas, linhas por UF, count de valores únicos dos do id_inscricao e id_sequencial
Caso haja algo relacionado aos testes que vale a pena informar: retirada de id_escola dos testes da tabela de resultados

Summary by CodeRabbit

Release Notes

New Features
- Added three new ENEM datasets: participants, exam results, and socioeconomic questionnaires (2024+).
- ENEM microdata now distributed across separate tables instead of a single file.
- School identifiers are masked for institutions with fewer than 10 participants.
Documentation
- Updated guidance on data structure changes effective 2024.

folhesgabriel · 2026-02-13T10:13:07Z

@thaismdr pode verificar o teste de dicionários? Tiveram erros no teste ustom_dictionary_coverage_br_inep_enem__questionario_socioeconomico

folhesgabriel · 2026-02-13T10:18:06Z

+import pandas as pd
+
+warnings.filterwarnings("ignore")
+


Pelo que vi a função to_partitions é igual a função que temos no repositório de pipelines pipelines/utils.py -> to_partitions; Se for, importe ela diretamente ao invés de copiar e colar no script de cada tabela de tratamento. Assim evitamos duplicação de código sem necessidade :)

tricktx · 2026-03-10T02:42:57Z

@thaismdr conseguiu validar os testes? Precisa de alguma ajuda?

coderabbitai · 2026-03-27T00:03:36Z

📝 Walkthrough

Walkthrough

A new ENEM dataset pipeline is introduced with three SQL models that normalize and partition staging source data into BigQuery tables, complemented by Python ingestion scripts that read raw CSV from external sources, apply transformations, and write partitioned outputs. Comprehensive schema definitions with validation tests and documentation are also added.

Changes

Cohort / File(s)	Summary
SQL Data Models `models/br_inep_enem/br_inep_enem__participantes.sql`, `models/br_inep_enem/br_inep_enem__resultados.sql`, `models/br_inep_enem/br_inep_enem__questionario_socioeconomico_2024.sql`	New dbt models sourcing from staging tables with `safe_cast` type normalization (identifiers/categories to string, years/grades to numeric types) and BigQuery partitioning by `ano` (2024–2025 range). Metadata labels include `project_id=basedosdados` and `tema=educacao`.
Python Data Ingestion & Partitioning `models/br_inep_enem/code/participantes.py`, `models/br_inep_enem/code/resultados.py`	New modules implementing CSV ingestion from Google Drive with chunked reading, column rename/coercion, and `to_partitions()` utility for writing Hive-style partitioned output (CSV with append mode or Parquet with gzip compression) by `ano`. Executes immediately on import.
Schema & Model Definitions `models/br_inep_enem/schema.yml`	Added three model declarations with comprehensive column definitions, uniqueness/relationship tests (cross-referencing `br_bd_diretorios_brasil__municipio` and `br_bd_diretorios_brasil__uf`), and `custom_dictionary_coverage` validation against the ENEM dictionary.
Metadata & Documentation `models/br_inep_enem/readme.md`, `models/br_inep_enem/br_inep_enem__dicionario.sql`	Added documentation clarifying the 2024 structural shift from single base to multi-table distribution (participants, results, socioeconomic questionnaire), note on `id_escola` masking for <10 participant institutions, and dictionary update comment (Feb-2026).

Sequence Diagram(s)

sequenceDiagram
    participant GDrive as Google Drive<br/>(Raw CSV)
    participant PythonETL as Python ETL<br/>(participantes.py,<br/>resultados.py)
    participant PartStorage as Partitioned Storage<br/>(Hive-style)
    participant Staging as Staging Source<br/>(br_inep_enem_staging)
    participant SQLModels as dbt SQL Models
    participant BQ as BigQuery<br/>(br_inep_enem)

    GDrive->>PythonETL: Read CSV (PARTICIPANTES_2024.csv,<br/>RESULTADOS_2024.csv) in chunks
    PythonETL->>PythonETL: Transform & coerce columns<br/>to standardized types
    PythonETL->>PartStorage: Write partitioned by ano<br/>(key=value/data.csv or .parquet)
    PartStorage->>Staging: Data loaded into<br/>staging tables
    
    Staging->>SQLModels: Source data available<br/>(br_inep_enem_staging.*)
    SQLModels->>SQLModels: safe_cast columns<br/>to target types
    SQLModels->>SQLModels: Apply partitioning<br/>config (ano range 2024–2025)
    SQLModels->>BQ: Materialized output<br/>(participantes, resultados,<br/>questionario_socioeconomico_2024)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐰 Hop hop, the data flows
From drives to partitions, our CSV glows
Cast and partition, by ano we go,
ENEM tables bloom in BigQuery's glow! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[Data] br_inep_enem' is concise, follows the repository's naming convention, and directly identifies the main change (new ENEM data tables for 2024).
Description check	✅ Passed	The description covers key sections from the template: PR objective (new ENEM tables), technical details (data structure changes, LGPD masking), and testing/validation. However, Risk/Mitigation and Dependencies sections are incomplete or missing.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch update_enem

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@models/br_inep_enem/br_inep_enem__dicionario.sql`:
- Line 3: The test spec br_inep_enem__questionario_socioeconomico_2024 uses
uppercase column names Q001–Q023 but the SQL model defines them as lowercase
q001–q023; update the test specification to use lowercase column names (q001
through q023) so custom_dictionary_coverage, which matches nome_coluna
case-sensitively, will find the dictionary entries; ensure all references in
that spec to Q001–Q023 are changed to q001–q023.

In `@models/br_inep_enem/code/resultados.py`:
- Around line 89-91: The module currently uses a hard-coded global
caminho_leitura and calls read_csv_enem() on import, causing an immediate
ingestion; refactor by converting read_csv_enem into a typed function signature
(e.g., def read_csv_enem(caminho_leitura: str, caminho_saida: str) -> None) with
a Google-style docstring, remove reliance on the module-level caminho_leitura
variable, make the progress counter a local variable inside read_csv_enem, and
move the actual invocation into an if __name__ == "__main__": block that parses
or passes concrete paths; update any other functions referenced in this flow to
include type hints as needed.
- Around line 65-76: The CSV branch currently always appends to an existing
partition file (file_filter_save_path) via df_filter.to_csv(..., mode="a",
header=not file_filter_save_path.exists()), which causes duplicate rows on
retries; change the write strategy in the CSV path when file_type == "csv" to
either (a) remove/clear the existing file_filter_save_path once at the start of
the run before any chunk writes, or (b) write chunks into a fresh temporary
directory/temporary file and only atomically move/publish the final data.csv to
file_filter_save_path after the full load completes; apply the same fix to the
identical helper in models/br_inep_enem/code/participantes.py and ensure usage
of df_filter and file_filter_save_path is preserved when switching from append
to overwrite/publish.
- Around line 53-58: The current multi-column partition filter using
DataFrame.isin(...).all(axis=1) in the df_filter construction incorrectly
matches values regardless of column mapping; change it to build a column-wise
boolean mask by iterating over filter_combination.items() and combining
(data[col] == value) with & so each column is matched to its specific value
(refer to df_filter, data, filter_combination). Also add missing return type
annotations and a short docstring to read_csv_enem() and move the module-level
invocation into an if __name__ == "__main__": block so the module can be safely
imported.

In `@models/br_inep_enem/readme.md`:
- Around line 5-13: The headings "Testes realizados" and "Mudanças na
organização dos dados" are currently H3 (###) after the document H1, triggering
MD001; change those headings to H2 (##) so they follow the top-level heading
correctly. Locate the lines containing "### Testes realizados" and "### Mudanças
na organização dos dados" in the README and replace the triple-# with double-#
for each heading to resolve the markdownlint MD001 warning.

In `@models/br_inep_enem/schema.yml`:
- Around line 321-322: Update the description for the field id_escola in the
schema to state that values are masked for schools with fewer than 10
participants (i.e., not a raw INEP identifier) so downstream consumers don't
treat it as a reliable join key; edit the description text for the id_escola
field in models/br_inep_enem/schema.yml to explicitly mention
masking/obfuscation policy and any implications for joins or uniqueness.
- Around line 284-295: The descriptions for the fields id_municipio_prova and
sigla_uf_prova are incorrect (they currently say the school location) and will
publish wrong catalog metadata; update the description text for
id_municipio_prova and sigla_uf_prova to clearly state they refer to the
municipality and state where the prova (test/exam) took place (e.g., "ID do
município onde a prova foi realizada - IBGE 7 Dígitos" and "Sigla da unidade da
federação onde a prova foi realizada"), leaving the existing relationship tests
to ref('br_bd_diretorios_brasil__municipio') and
ref('br_bd_diretorios_brasil__uf') unchanged.
- Around line 431-456: The custom_dictionary_coverage block currently lists
columns as Q001–Q023 but the model and 2023 sibling use lowercase names
(q001–q023), causing the coverage macro
(tests-dbt/generic/custom_dictionary_coverage.sql) to miss matches; update the
columns_covered_by_dictionary entries in the custom_dictionary_coverage mapping
to use the lowercase names (q001 through q023) so they match the model and the
dictionary lookup (leave dictionary_model: ref('br_inep_enem__dicionario')
unchanged).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e229039c-1a33-4169-9a13-6801f0d3e149

📥 Commits

Reviewing files that changed from the base of the PR and between 1183f22 and e8c9a07.

📒 Files selected for processing (8)

models/br_inep_enem/br_inep_enem__dicionario.sql
models/br_inep_enem/br_inep_enem__participantes.sql
models/br_inep_enem/br_inep_enem__questionario_socioeconomico_2024.sql
models/br_inep_enem/br_inep_enem__resultados.sql
models/br_inep_enem/code/participantes.py
models/br_inep_enem/code/resultados.py
models/br_inep_enem/readme.md
models/br_inep_enem/schema.yml

coderabbitai · 2026-03-30T19:06:13Z

@@ -1,5 +1,6 @@
 {{ config(alias="dicionario", schema="br_inep_enem") }}

+-- atualizado em fev-2026. Acrescenta o dicionários das tabelas de 2024


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's explore the repository structure find . -type f -name "*.sql" | grep -E "(br_inep_enem|dicionario)" | head -20

Repository: basedosdados/pipelines

Length of output: 1262

🏁 Script executed:

# Check the actual content of the file in question cat -n models/br_inep_enem/br_inep_enem__dicionario.sql

Repository: basedosdados/pipelines

Length of output: 579

🏁 Script executed:

# Search for validation rules or tests related to dictionary coverage rg "custom_dictionary_coverage_br_inep_enem__questionario_socioeconomico" --type sql --type yaml --type py

Repository: basedosdados/pipelines

Length of output: 48

🏁 Script executed:

# Look for the staging data source referenced in the model rg "br_inep_enem_staging\.dicionario" --type sql -A 5 -B 5

Repository: basedosdados/pipelines

Length of output: 667

🏁 Script executed:

# Search for validation rules in YAML files rg "custom_dictionary_coverage|questionario_socioeconomico" --type yaml -A 3 -B 3

Repository: basedosdados/pipelines

Length of output: 37563

🏁 Script executed:

# Look for dbt tests or validations related to br_inep_enem find . -type f $ -name "*.yml" -o -name "*.yaml" $ | xargs grep -l "br_inep_enem" | head -10

Repository: basedosdados/pipelines

Length of output: 116

🏁 Script executed:

# Search for any mention of dictionary coverage tests rg "dictionary_coverage|dicionario.*coverage" --type yaml --type sql -i

Repository: basedosdados/pipelines

Length of output: 4128

🏁 Script executed:

# Check the schema.yml or test files in br_inep_enem directory ls -la models/br_inep_enem/

Repository: basedosdados/pipelines

Length of output: 3160

🏁 Script executed:

# Find and display the test configuration for questionario_socioeconomico_2024 grep -A 15 "name: br_inep_enem__questionario_socioeconomico_2024" models/br_inep_enem/schema.yml | head -30

Repository: basedosdados/pipelines

Length of output: 546

🏁 Script executed:

# Look at the test definition itself cat tests-dbt/generic/custom_dictionary_coverage.sql

Repository: basedosdados/pipelines

Length of output: 2262

🏁 Script executed:

# Check what columns are actually in the 2024 questionario model cat -n models/br_inep_enem/br_inep_enem__questionario_socioeconomico_2024.sql

Repository: basedosdados/pipelines

Length of output: 1342

🏁 Script executed:

# Look at the full schema.yml section for questionario_socioeconomico_2024 to see all tested columns grep -A 50 "name: br_inep_enem__questionario_socioeconomico_2024" models/br_inep_enem/schema.yml | grep -A 45 "custom_dictionary_coverage:"

Repository: basedosdados/pipelines

Length of output: 1772

🏁 Script executed:

# Check the 2023 version for comparison grep -A 50 "name: br_inep_enem__questionario_socioeconomico_2023" models/br_inep_enem/schema.yml | grep -A 45 "custom_dictionary_coverage:"

Repository: basedosdados/pipelines

Length of output: 1691

🏁 Script executed:

# Check if there are any seed files or data updates for the dictionary find . -type f $ -name "*.csv" -o -name "*.yml" $ | grep -i "dict\|seed" | head -20

Repository: basedosdados/pipelines

Length of output: 48

🏁 Script executed:

# Look for staging references or source definitions rg "br_inep_enem_staging" --type yaml --type sql -B 2 -A 2 | head -40

Repository: basedosdados/pipelines

Length of output: 3007

🏁 Script executed:

# Check if there's a sources.yml or similar that defines br_inep_enem_staging find . -name "sources.yml" -o -name "*staging*" -type f | head -20

Repository: basedosdados/pipelines

Length of output: 101

🏁 Script executed:

# Search for any .github workflows or CI config that might show test results find .github -name "*.yml" -o -name "*.yaml" | head -10

Repository: basedosdados/pipelines

Length of output: 382

🏁 Script executed:

# Check dbt_project.yml or other configs that might show test execution cat dbt_project.yml | head -50

Repository: basedosdados/pipelines

Length of output: 2117

🏁 Script executed:

# Search for any recent changes or comments about dictionary coverage failures git log --oneline --all -20 2>/dev/null | head -20

Repository: basedosdados/pipelines

Length of output: 1124

🏁 Script executed:

# Look for any documentation about the custom_dictionary_coverage test rg "dictionary.*coverage|coverage.*dictionary" --type md -B 2 -A 2

Repository: basedosdados/pipelines

Length of output: 48

🏁 Script executed:

# Check the actual column name casing in schema.yml for 2024 test more carefully sed -n '/name: br_inep_enem__questionario_socioeconomico_2024/,/name: br_inep_enem__questionario_socioeconomico_2023/p' models/br_inep_enem/schema.yml | grep -A 30 "custom_dictionary_coverage:"

Repository: basedosdados/pipelines

Length of output: 844

🏁 Script executed:

# Check if there are any recent PRs or branches with updates to the staging data git branch -a 2>/dev/null | grep -i "enem\|dict\|2024" | head -10

Repository: basedosdados/pipelines

Length of output: 48

🏁 Script executed:

# Check the 2023 questionario SQL model for column naming cat -n models/br_inep_enem/br_inep_enem__questionario_socioeconomico_2023.sql | head -30

Repository: basedosdados/pipelines

Length of output: 1317

🏁 Script executed:

# Check if there's a pattern - look at what case the older models use rg "safe_cast\(q[0-9]" models/br_inep_enem/ | head -10

Repository: basedosdados/pipelines

Length of output: 995

🏁 Script executed:

# Verify the exact case used in 2023 test spec vs 2024 echo "=== 2023 test spec ===" && \ sed -n '/name: br_inep_enem__questionario_socioeconomico_2023/,/name: br_inep_enem__questionario_socioeconomico_2024/p' models/br_inep_enem/schema.yml | grep -A 30 "custom_dictionary_coverage:" | head -35 && \ echo -e "\n=== 2024 test spec ===" && \ sed -n '/name: br_inep_enem__questionario_socioeconomico_2024/,/columns:/p' models/br_inep_enem/schema.yml | grep -A 30 "custom_dictionary_coverage:"

Repository: basedosdados/pipelines

Length of output: 813

🏁 Script executed:

# Get just the 2024 test spec clearly awk '/name: br_inep_enem__questionario_socioeconomico_2024/,/name: [a-z_]*$/ {print NR": "$0}' models/br_inep_enem/schema.yml | grep -A 35 "custom_dictionary_coverage:" | head -40

Repository: basedosdados/pipelines

Length of output: 819

🏁 Script executed:

# Alternative: extract from line numbers grep -n "name: br_inep_enem__questionario_socioeconomico_2024" models/br_inep_enem/schema.yml

Repository: basedosdados/pipelines

Length of output: 126

🏁 Script executed:

# Confirm the exact line numbers and compare sed -n '405,430p' models/br_inep_enem/schema.yml | grep -A 25 "custom_dictionary_coverage:"

Repository: basedosdados/pipelines

Length of output: 48

🏁 Script executed:

# Check if the dictionary test is sensitive to column name casing cat tests-dbt/generic/custom_dictionary_coverage.sql | grep -A 5 "nome_coluna"

Repository: basedosdados/pipelines

Length of output: 403

🏁 Script executed:

# Check the dictionary contents to see what column names it has for questionario tables # Look for patterns in older questionario entries rg "questionario_socioeconomico" models/br_inep_enem/schema.yml | grep "nome_coluna\|Q0\|q0" | head -20

Repository: basedosdados/pipelines

Length of output: 48

🏁 Script executed:

# Get a clearer picture - show the exact test definition lines for both 2023 and 2024 sed -n '405,460p' models/br_inep_enem/schema.yml

Repository: basedosdados/pipelines

Length of output: 1884

🏁 Script executed:

# Check the git diff or changes to see if dictionario entries were added git diff HEAD~1..HEAD models/br_inep_enem/br_inep_enem__dicionario.sql 2>/dev/null || echo "No git history available"

Repository: basedosdados/pipelines

Length of output: 90

🏁 Script executed:

# Look at what columns the 2023 model test actually specifies to confirm the pattern sed -n '400,425p' models/br_inep_enem/schema.yml

Repository: basedosdados/pipelines

Length of output: 1279

🏁 Script executed:

# Get the 2023 section line by line to see the exact column names in its test grep -n "name: br_inep_enem__questionario_socioeconomico_2023" models/br_inep_enem/schema.yml

Repository: basedosdados/pipelines

Length of output: 126

🏁 Script executed:

# Show that range sed -n '365,430p' models/br_inep_enem/schema.yml

Repository: basedosdados/pipelines

Length of output: 3317

🏁 Script executed:

# Show the 2023 model test definition sed -n '516,560p' models/br_inep_enem/schema.yml

Repository: basedosdados/pipelines

Length of output: 1557

Fix column name case mismatch in 2024 questionario test specification.

The br_inep_enem__questionario_socioeconomico_2024 test spec references uppercase column names (Q001–Q023) but the SQL model defines them in lowercase (q001–q023). The custom_dictionary_coverage test searches the dictionary for an exact case match on nome_coluna, so it will fail to find entries. Match the 2023 pattern: use lowercase column names (q001–q023) in the test specification.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@models/br_inep_enem/br_inep_enem__dicionario.sql` at line 3, The test spec br_inep_enem__questionario_socioeconomico_2024 uses uppercase column names Q001–Q023 but the SQL model defines them as lowercase q001–q023; update the test specification to use lowercase column names (q001 through q023) so custom_dictionary_coverage, which matches nome_coluna case-sensitively, will find the dictionary entries; ensure all references in that spec to Q001–Q023 are changed to q001–q023.

coderabbitai · 2026-03-30T19:06:13Z

+            df_filter = data.loc[
+                data[filter_combination.keys()]
+                .isin(filter_combination.values())
+                .all(axis=1),
+                :,
+            ]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Reproduces the current multi-column partition filter behavior. python - <<'PY' import pandas as pd df = pd.DataFrame( {"ano": [2024, 1], "mes": [1, 2024], "row": ["expected", "wrong"]} ) fc = {"ano": 2024, "mes": 1} current = df.loc[ df[list(fc.keys())].isin(list(fc.values())).all(axis=1), "row", ].tolist() expected = df.loc[ df["ano"].eq(2024) & df["mes"].eq(1), "row", ].tolist() print("current_filter:", current) print("expected_filter:", expected) PY

Repository: basedosdados/pipelines

Length of output: 184

🏁 Script executed:

# Locate the file fd "resultados.py" --type f

Repository: basedosdados/pipelines

Length of output: 104

🏁 Script executed:

# If found, check the file content around lines 53-58 cat -n models/br_inep_enem/code/resultados.py | head -100

Repository: basedosdados/pipelines

Length of output: 4020

🏁 Script executed:

# Continue reading the file to see the complete read_csv_enem function cat -n models/br_inep_enem/code/resultados.py | tail -150

Repository: basedosdados/pipelines

Length of output: 7192

Fix multi-column partition filter to match by column-value pair, not by value set membership.

DataFrame.isin(filter_combination.values()).all(axis=1) matches any row where all values exist somewhere in the value set, ignoring which value belongs to which column. For example, with {"ano": 2024, "mes": 1}, it will incorrectly match ano=1, mes=2024. Although the current call uses only one partition column (line 240), the function advertises multi-column support (docstring example), making this a latent bug. Replace with column-matched equality:

🔧 Suggested fix

- df_filter = data.loc[ - data[filter_combination.keys()] - .isin(filter_combination.values()) - .all(axis=1), - :, - ] + mask = pd.Series(True, index=data.index) + for column, value in filter_combination.items(): + mask &= data[column].eq(value) + df_filter = data.loc[mask, :]

Additionally, add missing return type annotations and docstring to read_csv_enem() per guidelines (lines 94–244), and move the module-level function call (line 246) to a if __name__ == "__main__": block to allow safe imports.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

df_filter = data.loc[

data[filter_combination.keys()]

.isin(filter_combination.values())

.all(axis=1),

:,

]

mask = pd.Series(True, index=data.index)

for column, value in filter_combination.items():

mask &= data[column].eq(value)

df_filter = data.loc[mask, :]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@models/br_inep_enem/code/resultados.py` around lines 53 - 58, The current multi-column partition filter using DataFrame.isin(...).all(axis=1) in the df_filter construction incorrectly matches values regardless of column mapping; change it to build a column-wise boolean mask by iterating over filter_combination.items() and combining (data[col] == value) with & so each column is matched to its specific value (refer to df_filter, data, filter_combination). Also add missing return type annotations and a short docstring to read_csv_enem() and move the module-level invocation into an if __name__ == "__main__": block so the module can be safely imported.

coderabbitai · 2026-03-30T19:06:14Z

+            if file_type == "csv":
+                # append data to csv
+                file_filter_save_path = Path(filter_save_path) / "data.csv"
+                df_filter.to_csv(
+                    file_filter_save_path,
+                    sep=",",
+                    encoding="utf-8",
+                    na_rep="",
+                    index=False,
+                    mode="a",
+                    header=not file_filter_save_path.exists(),
+                )


⚠️ Potential issue | 🔴 Critical

Avoid append-only writes to a stable output path.

Because Line 74 always appends to the existing partition file, rerunning or retrying this loader duplicates every chunk that was already written. Please clear the destination once per run or write into a fresh temp directory and publish only after the load finishes. The identical helper in models/br_inep_enem/code/participantes.py has the same corruption risk.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@models/br_inep_enem/code/resultados.py` around lines 65 - 76, The CSV branch currently always appends to an existing partition file (file_filter_save_path) via df_filter.to_csv(..., mode="a", header=not file_filter_save_path.exists()), which causes duplicate rows on retries; change the write strategy in the CSV path when file_type == "csv" to either (a) remove/clear the existing file_filter_save_path once at the start of the run before any chunk writes, or (b) write chunks into a fresh temporary directory/temporary file and only atomically move/publish the final data.csv to file_filter_save_path after the full load completes; apply the same fix to the identical helper in models/br_inep_enem/code/participantes.py and ensure usage of df_filter and file_filter_save_path is preserved when switching from append to overwrite/publish.

coderabbitai · 2026-03-30T19:06:14Z

+caminho_leitura = (
+    "/content/drive/MyDrive/conjuntos/dados_brutos/br_inep_enem/microdados/"
+)


⚠️ Potential issue | 🟠 Major

Make this an explicit script entrypoint.

The hard-coded /content/drive/... paths plus the bare read_csv_enem() call at Line 246 mean any import triggers a full local ingest. Please pass the paths into a typed read_csv_enem(...)->None, move the invocation under if __name__ == "__main__":, and keep the progress counter local.

♻️ Suggested refactor

-valor = 0 - -caminho_leitura = ( - "/content/drive/MyDrive/conjuntos/dados_brutos/br_inep_enem/microdados/" -) - - -def read_csv_enem(): - global valor - for df in pd.read_csv( - caminho_leitura + "RESULTADOS_2024.csv", +DEFAULT_INPUT_PATH = ( + "/content/drive/MyDrive/conjuntos/dados_brutos/br_inep_enem/microdados/" + "RESULTADOS_2024.csv" +) +DEFAULT_OUTPUT_PATH = "/content/drive/MyDrive/conjuntos/br_inep_enem/resultados/" + + +def read_csv_enem(input_path: str, output_path: str) -> None: + """Read the ENEM resultados CSV and write partitioned files. + + Args: + input_path: Source CSV path. + output_path: Root directory for partitioned output. + """ + for chunk_number, df in enumerate( + pd.read_csv( + input_path, sep=";", encoding="latin1", chunksize=100000, - ): - valor = valor + 1 - print(valor) + ), + start=1, + ): + print(chunk_number) + # ... existing transformation code ... to_partitions( data=df_lista, partition_columns=["ano"], - savepath="/content/drive/MyDrive/conjuntos/br_inep_enem/resultados/", + savepath=output_path, file_type="csv", ) -read_csv_enem() +if __name__ == "__main__": + read_csv_enem(DEFAULT_INPUT_PATH, DEFAULT_OUTPUT_PATH)

As per coding guidelines, **/*.py: Add type hints and docstrings for Python functions following Google Style.

Also applies to: 94-103, 238-246

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@models/br_inep_enem/code/resultados.py` around lines 89 - 91, The module currently uses a hard-coded global caminho_leitura and calls read_csv_enem() on import, causing an immediate ingestion; refactor by converting read_csv_enem into a typed function signature (e.g., def read_csv_enem(caminho_leitura: str, caminho_saida: str) -> None) with a Google-style docstring, remove reliance on the module-level caminho_leitura variable, make the progress counter a local variable inside read_csv_enem, and move the actual invocation into an if __name__ == "__main__": block that parses or passes concrete paths; update any other functions referenced in this flow to include type hints as needed.

coderabbitai · 2026-03-30T19:06:14Z

+### Testes realizados
+
+---
+ * Sobre a tabela ``resultados``
+**Relação id_escola (Relationship: ID Escola)**
+O código real da escola foi substituído por uma máscara quando a instituição possui menos de 10 participantes no exame.  Como consequência, o campo de identificação da escola nem sempre corresponde ao código oficial, inviabilizando a validação direta dessa chave.
+
+---
+### Mudanças na organização dos dados


⚠️ Potential issue | 🟡 Minor

Use ## for these section headings.

After the H1 on Line 1, jumping straight to H3 triggers the current markdownlint MD001 warning.

📝 Suggested fix

-### Testes realizados +## Testes realizados -### Mudanças na organização dos dados +## Mudanças na organização dos dados

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### Testes realizados

---

* Sobre a tabela ``resultados``

**Relação id_escola (Relationship: ID Escola)**

O código real da escola foi substituído por uma máscara quando a instituição possui menos de 10 participantes no exame. Como consequência, o campo de identificação da escola nem sempre corresponde ao código oficial, inviabilizando a validação direta dessa chave.

---

### Mudanças na organização dos dados

## Testes realizados

---

* Sobre a tabela ``resultados``

**Relação id_escola (Relationship: ID Escola)**

O código real da escola foi substituído por uma máscara quando a instituição possui menos de 10 participantes no exame. Como consequência, o campo de identificação da escola nem sempre corresponde ao código oficial, inviabilizando a validação direta dessa chave.

---

## Mudanças na organização dos dados

🧰 Tools

🪛 markdownlint-cli2 (0.22.0)

[warning] 5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@models/br_inep_enem/readme.md` around lines 5 - 13, The headings "Testes realizados" and "Mudanças na organização dos dados" are currently H3 (###) after the document H1, triggering MD001; change those headings to H2 (##) so they follow the top-level heading correctly. Locate the lines containing "### Testes realizados" and "### Mudanças na organização dos dados" in the README and replace the triple-# with double-# for each heading to resolve the markdownlint MD001 warning.

coderabbitai · 2026-03-30T19:06:14Z

+      - name: id_municipio_prova
+        description: ID Município da escola em que estudou - IBGE 7 Dígitos
+        tests:
+          - relationships:
+              to: ref('br_bd_diretorios_brasil__municipio')
+              field: id_municipio
+      - name: sigla_uf_prova
+        description: Sigla da unidade da federação da escola em que estudou
+        tests:
+          - relationships:
+              to: ref('br_bd_diretorios_brasil__uf')
+              field: sigla


⚠️ Potential issue | 🟡 Minor

Correct the prova geography descriptions.

id_municipio_prova and sigla_uf_prova are documented as school-location fields. That will publish the wrong catalog metadata for these columns.

📝 Proposed fix

- name: id_municipio_prova - description: ID Município da escola em que estudou - IBGE 7 Dígitos + description: ID Município da cidade da prova - IBGE 7 Dígitos tests: - relationships: to: ref('br_bd_diretorios_brasil__municipio') field: id_municipio - name: sigla_uf_prova - description: Sigla da unidade da federação da escola em que estudou + description: Sigla da unidade da federação da prova tests: - relationships: to: ref('br_bd_diretorios_brasil__uf') field: sigla

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@models/br_inep_enem/schema.yml` around lines 284 - 295, The descriptions for the fields id_municipio_prova and sigla_uf_prova are incorrect (they currently say the school location) and will publish wrong catalog metadata; update the description text for id_municipio_prova and sigla_uf_prova to clearly state they refer to the municipality and state where the prova (test/exam) took place (e.g., "ID do município onde a prova foi realizada - IBGE 7 Dígitos" and "Sigla da unidade da federação onde a prova foi realizada"), leaving the existing relationship tests to ref('br_bd_diretorios_brasil__municipio') and ref('br_bd_diretorios_brasil__uf') unchanged.

coderabbitai · 2026-03-30T19:06:14Z

+      - name: id_escola
+        description: ID Escola - Inep


⚠️ Potential issue | 🟡 Minor

Document that id_escola is masked.

This field is no longer a plain INEP identifier: this PR masks schools with fewer than 10 participants. Leaving the description generic makes downstream joins look valid when they are intentionally obfuscated.

📝 Proposed fix

- name: id_escola - description: ID Escola - Inep + description: ID Escola - Inep. Valores são mascarados para escolas com menos de 10 participantes, conforme regra de LGPD.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: id_escola

description: ID Escola - Inep

- name: id_escola

description: ID Escola - Inep. Valores são mascarados para escolas com menos de 10 participantes, conforme regra de LGPD.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@models/br_inep_enem/schema.yml` around lines 321 - 322, Update the description for the field id_escola in the schema to state that values are masked for schools with fewer than 10 participants (i.e., not a raw INEP identifier) so downstream consumers don't treat it as a reliable join key; edit the description text for the id_escola field in models/br_inep_enem/schema.yml to explicitly mention masking/obfuscation policy and any implications for joins or uniqueness.

coderabbitai · 2026-03-30T19:06:14Z

+      - custom_dictionary_coverage:
+          columns_covered_by_dictionary:
+            - Q001
+            - Q002
+            - Q003
+            - Q004
+            - Q005
+            - Q006
+            - Q007
+            - Q008
+            - Q009
+            - Q010
+            - Q011
+            - Q012
+            - Q013
+            - Q014
+            - Q015
+            - Q016
+            - Q017
+            - Q018
+            - Q019
+            - Q020
+            - Q021
+            - Q022
+            - Q023
+          dictionary_model: ref('br_inep_enem__dicionario')


⚠️ Potential issue | 🟠 Major

Align custom_dictionary_coverage with the 2024 column names.

The macro in tests-dbt/generic/custom_dictionary_coverage.sql compares nome_coluna with the exact string listed here. This block uses Q001–Q023, while this model declares q001–q023 and the 2023 sibling follows the same lowercase convention, so the coverage lookup will miss the 2024 dictionary rows and fail on populated values.

🛠️ Proposed fix

- custom_dictionary_coverage: columns_covered_by_dictionary: - - Q001 - - Q002 - - Q003 - - Q004 - - Q005 - - Q006 - - Q007 - - Q008 - - Q009 - - Q010 - - Q011 - - Q012 - - Q013 - - Q014 - - Q015 - - Q016 - - Q017 - - Q018 - - Q019 - - Q020 - - Q021 - - Q022 - - Q023 + - q001 + - q002 + - q003 + - q004 + - q005 + - q006 + - q007 + - q008 + - q009 + - q010 + - q011 + - q012 + - q013 + - q014 + - q015 + - q016 + - q017 + - q018 + - q019 + - q020 + - q021 + - q022 + - q023

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- custom_dictionary_coverage:

columns_covered_by_dictionary:

- Q001

- Q002

- Q003

- Q004

- Q005

- Q006

- Q007

- Q008

- Q009

- Q010

- Q011

- Q012

- Q013

- Q014

- Q015

- Q016

- Q017

- Q018

- Q019

- Q020

- Q021

- Q022

- Q023

dictionary_model: ref('br_inep_enem__dicionario')

- custom_dictionary_coverage:

columns_covered_by_dictionary:

- q001

- q002

- q003

- q004

- q005

- q006

- q007

- q008

- q009

- q010

- q011

- q012

- q013

- q014

- q015

- q016

- q017

- q018

- q019

- q020

- q021

- q022

- q023

dictionary_model: ref('br_inep_enem__dicionario')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@models/br_inep_enem/schema.yml` around lines 431 - 456, The custom_dictionary_coverage block currently lists columns as Q001–Q023 but the model and 2023 sibling use lowercase names (q001–q023), causing the coverage macro (tests-dbt/generic/custom_dictionary_coverage.sql) to miss matches; update the columns_covered_by_dictionary entries in the custom_dictionary_coverage mapping to use the lowercase names (q001 through q023) so they match the model and the dictionary lookup (leave dictionary_model: ref('br_inep_enem__dicionario') unchanged).

Atualiza br_inep_enem 2024

4266c89

thaismdr requested a review from a team February 11, 2026 20:58

thaismdr self-assigned this Feb 11, 2026

thaismdr added the table-approve Triggers Table Approve on PR merge label Feb 11, 2026

mergify Bot and others added 5 commits February 11, 2026 20:58

Merge branch 'main' into update_enem

d90a72f

Adiciona readme br_inep_enem

4186f61

Merge branch 'main' into update_enem

af0eb16

Merge branch 'main' into update_enem

20ead45

Merge branch 'main' into update_enem

5877101

folhesgabriel added the test-dev-model Run DBT tests in the modified models using basedosdados-dev Bigquery Project label Feb 13, 2026

folhesgabriel reviewed Feb 13, 2026

View reviewed changes

mergify Bot added 11 commits February 13, 2026 13:55

Merge branch 'main' into update_enem

d2f031d

Merge branch 'main' into update_enem

3877371

Merge branch 'main' into update_enem

4f0f189

Merge branch 'main' into update_enem

4afc531

Merge branch 'main' into update_enem

b9f6188

Merge branch 'main' into update_enem

0d8a35e

Merge branch 'main' into update_enem

4044675

Merge branch 'main' into update_enem

9e4323f

Merge branch 'main' into update_enem

176b87e

Merge branch 'main' into update_enem

cad56f1

Merge branch 'main' into update_enem

ae5f0f8

mergify Bot added 6 commits March 12, 2026 19:20

Merge branch 'main' into update_enem

3e28fe9

Merge branch 'main' into update_enem

029f044

Merge branch 'main' into update_enem

a8a55d4

Merge branch 'main' into update_enem

ab10c27

Merge branch 'main' into update_enem

9eded9d

Merge branch 'main' into update_enem

b99054f

mergify Bot added 2 commits March 24, 2026 14:25

Merge branch 'main' into update_enem

85457d3

Merge branch 'main' into update_enem

5c1ef08

Merge branch 'main' into update_enem

e8c9a07

coderabbitai Bot reviewed Mar 30, 2026

View reviewed changes

mergify Bot added 13 commits March 30, 2026 22:13

Merge branch 'main' into update_enem

16f5521

Merge branch 'main' into update_enem

5fdbf7f

Merge branch 'main' into update_enem

3453d20

Merge branch 'main' into update_enem

fb094a1

Merge branch 'main' into update_enem

14a0039

Merge branch 'main' into update_enem

9f266d1

Merge branch 'main' into update_enem

3fdd091

Merge branch 'main' into update_enem

7914e29

Merge branch 'main' into update_enem

61607c5

Merge branch 'main' into update_enem

ee9424a

Merge branch 'main' into update_enem

953ebf9

Merge branch 'main' into update_enem

46b4e29

Merge branch 'main' into update_enem

dfaf6be

		@@ -1,5 +1,6 @@
		{{ config(alias="dicionario", schema="br_inep_enem") }}

		-- atualizado em fev-2026. Acrescenta o dicionários das tabelas de 2024

Conversation

thaismdr commented Feb 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Template Pull Requests - Pipeline

Descrição do PR:

Detalhes Técnicos:

Teste e Validações:

Summary by CodeRabbit

Release Notes

Uh oh!

folhesgabriel commented Feb 13, 2026

Uh oh!

folhesgabriel Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tricktx commented Mar 10, 2026

Uh oh!

coderabbitai Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thaismdr commented Feb 11, 2026 •

edited by coderabbitai Bot

Loading

folhesgabriel Feb 13, 2026 •

edited

Loading

coderabbitai Bot commented Mar 27, 2026 •

edited

Loading