Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
c76ee1d
add dag file UNflores
dipaolme Nov 3, 2022
6c5e0c1
extract function added
dipaolme Nov 4, 2022
18e67c8
Grupo B sprint1
BROC95 Nov 5, 2022
cc50cb9
Datos csv GB
BROC95 Nov 5, 2022
c9a25c4
Grupo B
BROC95 Nov 7, 2022
5885d29
Grupo B
BROC95 Nov 7, 2022
27203ae
upload sql files, update dags task extract
dipaolme Nov 7, 2022
18fdaa2
upload csv
dipaolme Nov 7, 2022
bdd48fb
hook conn
BROC95 Nov 7, 2022
5520e4c
logger by un
BROC95 Nov 8, 2022
46f35dd
Transform function pandas
BROC95 Nov 9, 2022
1ae6601
change name, queries in sql files
dipaolme Nov 9, 2022
82e0b0b
Load S3 amazon
BROC95 Nov 9, 2022
158783a
TaskLoad
BROC95 Nov 9, 2022
87a1bc3
update .env
BROC95 Nov 10, 2022
0b6d87e
Merge pull request #1 from BROC95/grupoB
BROC95 Nov 10, 2022
b3fc451
bucket
BROC95 Nov 10, 2022
59e216f
update .env
BROC95 Nov 10, 2022
ebde1c2
mod .env
BROC95 Nov 10, 2022
f6d9834
config .env
BROC95 Nov 10, 2022
ec2440f
Dag dynamic
BROC95 Nov 10, 2022
98e23b8
update .env
BROC95 Nov 10, 2022
57ff106
Dag dynamic .env
BROC95 Nov 10, 2022
ad675e0
update .env
BROC95 Nov 10, 2022
4ffedc6
update .env
BROC95 Nov 10, 2022
aad6e31
update .env
BROC95 Nov 10, 2022
62429ec
dag factory
BROC95 Nov 11, 2022
e245275
dag factory
BROC95 Nov 11, 2022
c2c1feb
dags files update
dipaolme Nov 11, 2022
f751e5f
upload logs
dipaolme Nov 11, 2022
1f23b82
upload processed txt
dipaolme Nov 11, 2022
5723a77
upload csv files
dipaolme Nov 11, 2022
a898453
Update README.md
dipaolme Nov 12, 2022
9d74ef7
notebook v1
BROC95 Nov 12, 2022
208f69c
Update README.md
dipaolme Nov 12, 2022
e799d43
update data
BROC95 Nov 12, 2022
dee6392
Update README.md
dipaolme Nov 12, 2022
d1e15c8
update notebook
BROC95 Nov 12, 2022
587bdd8
update notebook
BROC95 Nov 12, 2022
1622dba
add grupo B link github
BROC95 Nov 12, 2022
d90c893
update assets
BROC95 Nov 12, 2022
3e8e6b5
Update notebook
BROC95 Nov 12, 2022
8ca6603
ETL
BROC95 Nov 13, 2022
4734c9e
Merge pull request #3 from dipaolme/BROC95-patch-1
dipaolme Nov 13, 2022
ccaf884
fix minors errors
dipaolme Nov 14, 2022
2e24985
dynamic dags
dipaolme Nov 14, 2022
c0245bc
upload notebook
dipaolme Nov 14, 2022
64c0012
Merge pull request #5 from dipaolme/grupoA
BROC95 Nov 14, 2022
621f07e
update name GB
BROC95 Nov 14, 2022
404edfa
Merge branch 'master' of https://github.com/dipaolme/Skill-Up-DA-c-Py…
BROC95 Nov 14, 2022
96a1797
folders
BROC95 Nov 14, 2022
ec418bc
folders
BROC95 Nov 14, 2022
a71e410
del gA
BROC95 Nov 15, 2022
51bcbfa
delGA
BROC95 Nov 15, 2022
4e9ee4c
del gA
BROC95 Nov 15, 2022
3b01633
del GA
BROC95 Nov 15, 2022
859ce0f
del GA
BROC95 Nov 15, 2022
08fcc63
ga
BROC95 Nov 15, 2022
3df35ac
ga
BROC95 Nov 15, 2022
2c3a425
ga
BROC95 Nov 15, 2022
84b06a4
ga
BROC95 Nov 15, 2022
abd13dd
ga
BROC95 Nov 15, 2022
089ce80
ga
BROC95 Nov 15, 2022
0b6c125
ga
BROC95 Nov 15, 2022
b221c17
ga
BROC95 Nov 15, 2022
7bf491e
updaje
BROC95 Nov 15, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
7 changes: 7 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
POSTGRES_CONN_ID = alkemy_db
AWS_S3_CONN_ID = aws_s3_bucket
ACCESS_KEY =
SECRET_ACCESS_KEY =
BUCKET =


44 changes: 44 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
.git
.vscode
.env
.ipynb_checkpoints
.code
.astro
Cleandata.ipynb
airflow_settings.yaml
astro
main.py
config.yaml
__pycache__
airflow_lab.py
airtask.py
example_dag_advanced.py
example_dag_basic.py
hookfunction.py
hookexample.py
/tests
Dockerfile
.dockerignore
packages.txt
docker-compose.yml
docker-compose.override.yml
logging.conf
dac.cfg
/logs
webserver_config.py
airflow.cfg
S3update_dag.py
config_bre.yaml
config_etl.yaml
template_dag.jinja2
.\Skill-Up-DA-c-PythonG1\.env

/dags_dynamic/.
./dags/dags_factory
.po.py
nuevo.py
coco.py
po.py
GAUNVillaMaria_dag_etl.py


157 changes: 5 additions & 152 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Proyecto #1 Flujos de ejecución


## Descripción
Client: Ministerio de Educación de la Nación
Situación inicial
Expand Down Expand Up @@ -44,157 +45,9 @@ consultas SQL.
códigos postales según requerimientos normalizados que se especifican para cada
grupo de universidades, utilizando Pandas.

### Assets 🎨


La base de datos con la información que reunió el Ministerio de Educación se proveerá en el transcurso del proyecto.


El archivo auxiliar de códigos postales se encuentra en la carpeta assets.


## Requerimientos:
# Airflow usando Docker
https://docs.astronomer.io/software/install-cli?tab=windows#install-the-astro-cli

## Modulos utilizados en Python
- pathlib
- logging
- pandas
- datetime
- os
- sqlalchemy

## Estructura y flujo de ejecución
Se generarán archivos ".sql" con las consultas correspondientes a cada centro educativo, normalizando las columnas tenidas en cuenta.

Mediante operadores disponibles en apache airflow (Python operators y postgre operators, se toman las consultas ".sql" para obtener los datos de la base de datos provista.

Estos datos se transorman mediante la libreria pandas, y se almacenan en forma local como archivos ".txt".

Finalmete, a traves de las herramientas provistas por AWS (operadores y hooks S3), los datos almacenados como ".txt" son transformados a strings, y almacenados en el servicio S3.

# Creación de una Wiki del proyecto
Se recomienda crear una wiki del proyecto en Github para dejar anotaciones, lecciones aprendidas o convenciones necesarias adicionales.

# **Convención para nombrar carpetas**

OT000-python

-airflow

-assets: archivos complementarios necesarios.

-dags: para dejar los flujos que se vayan creando

-datasets: para dejar el archivo resultante del proceso de transformación con Pandas

-files: para almacenar la extracción de la base de datos.

-include: para almacenar los SQL.


# **Convención para nombrar archivos**
### DAG ETL
Se colocará grupo-letra-siglas de la universidad y localidad, seguido por "_dag_elt.py" para diferenciar de otros csv.

EJ: GFUNRioCuarto_dag_etl.py


# **Convencion para el nombre de la base de datos**

### conexion con base de datos
se llamara 'alkemy_db'

### conexion para S3
se llamara 'aws_s3_bucket'

### csv generados
Se colocará grupo-letra-siglas de la universidad y localidad, seguido por "_select.csv" para diferenciar el dag realizado.

EJ: GFUNRioCuarto_select.csv

### txt generados
Se colocará grupo-letra-siglas de la universidad y localidad, seguido por "_process.txt" para diferenciar el dag realizado.

EJ: GFUNRioCuarto_process.txt

# MATERIAL COMPLEMENTARIO

# AIRFLOW

https://airflow.apache.org/

# Informacion importante de como comenzar con Airflow
https://www.astronomer.io/guides/airflow-sql-tutorial/

# Curso de Alkemy de Airflow
https://academy.alkemy.org/curso/python/contenidos/clase-1-introduccion-a-flujos-de-trabajo

# Guía definitiva de Airflow
[Guia](https://www.astronomer.io/ebooks/dags-definitive-guide.pdf)

# Airflow Hooks Explained 101
https://hevodata.com/learn/airflow-hooks/

# Create a S3 bucket into AWS.
https://docs.aws.amazon.com/es_es/elastictranscoder/latest/developerguide/gs-2-create-s3-buckets.html

Create a S3 bucket and call it as you want.

![image](https://user-images.githubusercontent.com/2921066/194301926-a98e757b-d618-432c-b103-98a2e91a563c.png)

## Structure of S3 bucket
This is an important part, as I follow a specific folder structure in python scripts. Define the structure as follow:

![image](https://user-images.githubusercontent.com/2921066/194302089-19e765a9-ef40-4245-9bbc-a53b2f0080e3.png)

After create the s3 bucket, upload into the folder "preprocess/" the csv file that is located in the root path "talks_info.csv".

## S3 IAM user
In order to be able to interact with the S3 bucket, we have to create an user (or use an existing one).

![image](https://user-images.githubusercontent.com/2921066/194302165-2ce84708-2f99-4669-a013-d1ff17558f0f.png)

## Permissions for the user
Since we have many services and specific permissions to interact with them, we have to assign the S3 permission to the new user.

![image](https://user-images.githubusercontent.com/2921066/194302244-d96f0220-34f6-4eb2-97cc-05db9fc0d7f2.png)

## Credentials
This is a very important step. You have to make sure of copy and save the credentials because we will use them later.

![image](https://user-images.githubusercontent.com/2921066/194302285-47cbb07e-4128-40f3-aabd-0c7c3a276831.png)

# Instalar providers de amazon
https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/index.html
# Documentacion de amazon s3 en airflow
https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/s3/index.html


# From Local Filesystem to Amazon S3
https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/operators/transfer/local_to_s3.html

## Quickstart AWS SDK for Python.
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html


## How to use Boto3 to upload files to an S3 Bucket?
https://www.learnaws.org/2022/07/13/boto3-upload-files-s3/


# Airflow Dynamic DAGs: The powerful way with Jinja and YAML
https://www.youtube.com/watch?v=HuMgMTHrkn4&ab_channel=DatawithMarc

# Dynamically Generating DAGs in Airflow
https://www.astronomer.io/guides/dynamically-generating-dags/

# Loggers

# Configuración del archivo de logger.cfg
https://docs.python.org/3/library/logging.config.html#logging-config-fileformat

# Ejemplo de archivo de configuración
https://realpython.com/lessons/logger-config-file/
## Integrantes y grupos asignados

- [Di Paola, Matias](https://github.com/dipaolme) - Grupo A
- [Breyner Ocampo Cardenas](https://github.com/BROC95) - Grupo B

41 changes: 41 additions & 0 deletions assets/GBUNComahue_dag_elt.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
[loggers]
keys=root,GBUNComahue_dag_elt

[handlers]
keys=consoleHandler

[formatters]
keys=detailedFormatter

[logger_root]
level=DEBUG
handlers=consoleHandler

[logger_GBUNComahue_dag_elt]
level=DEBUG
handlers=consoleHandler
qualname=GBUNComahue_dag_elt
propagate=0

[handler_consoleHandler]
class=StreamHandler
level=DEBUG
formatter=detailedFormatter
args=(sys.stdout,)

[handler_simpleHandler]
formatter=simpleFormatter
class=handlers.RotatingFileHandler
maxBytes=31457280
level=DEBUG
args=('/tmp/test.log',)


[formatter_detailedFormatter]
# format=%(asctime)s - %(name)s - %(levelname)s / %(message)s
format=%(asctime)s - %(name)s - %(message)s
datefmt='%Y-%m-%d'




41 changes: 41 additions & 0 deletions assets/GBUNSalvador_dag_elt.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
[loggers]
keys=root,GBUNSalvador_dag_elt

[handlers]
keys=consoleHandler

[formatters]
keys=detailedFormatter

[logger_root]
level=DEBUG
handlers=consoleHandler

[logger_GBUNSalvador_dag_elt]
level=DEBUG
handlers=consoleHandler
qualname=GBUNSalvador_dag_elt
propagate=0

[handler_consoleHandler]
class=StreamHandler
level=DEBUG
formatter=detailedFormatter
args=(sys.stdout,)

[handler_simpleHandler]
formatter=simpleFormatter
class=handlers.RotatingFileHandler
maxBytes=31457280
level=DEBUG
args=('/tmp/test.log',)


[formatter_detailedFormatter]
# format=%(asctime)s - %(name)s - %(levelname)s / %(message)s
format=%(asctime)s - %(name)s - %(message)s
datefmt='%Y-%m-%d'




Binary file added assets/dag-factory.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading