diff --git a/slides/images/inner-join.gif b/slides/images/inner-join.gif new file mode 100644 index 0000000..e44e9be Binary files /dev/null and b/slides/images/inner-join.gif differ diff --git a/slides/images/left-join.gif b/slides/images/left-join.gif new file mode 100644 index 0000000..b8c5ca7 Binary files /dev/null and b/slides/images/left-join.gif differ diff --git a/slides/images/omop1.png b/slides/images/omop1.png new file mode 100644 index 0000000..6664548 Binary files /dev/null and b/slides/images/omop1.png differ diff --git a/slides/images/original-dfs.png b/slides/images/original-dfs.png new file mode 100644 index 0000000..3d37419 Binary files /dev/null and b/slides/images/original-dfs.png differ diff --git a/slides/lesson1_slides.html b/slides/lesson1_slides.html index b0e4765..8f33fc5 100644 --- a/slides/lesson1_slides.html +++ b/slides/lesson1_slides.html @@ -1218,6 +1218,7 @@

W1: Database Concepts, DESCRIBE, SELECT, WHERE

Welcome!

Please sign-up for an account at Posit Cloud and accept our classroom invitation here: https://posit.cloud/spaces/689711/join?access_code=8kse5IYlL4kHIqZvKaQ6mXp8IMibFayMa10I8Izn

+

Our course website: https://intro-sql-fh.netlify.app/

Introductions

@@ -1249,12 +1250,12 @@

Introductions

Goals of the course

@@ -1264,7 +1265,7 @@

Content of the course

  • Database Concepts, DESCRIBE, SELECT, WHERE

  • JOINing tables

  • [No class week]

  • -
  • Calculating new fields, GROUP BY, CASE WHEN, HAVING

  • +
  • Grouping and Aggregating variables

  • Subqueries, Views, Pizza

  • @@ -1479,6 +1480,10 @@

    Our underlying data model

    +
    +

    A short survey on your interest and background

    +

    https://forms.gle/YADmDmukRKmGk2KFA

    +

    Let’s get started: connecting to the database

    @@ -2320,34 +2325,34 @@

    COUNT DISTINCT for unique entries

    -4058899 +4151422 -4295880 +4125906 -4216130 +44783196 -4024289 +4187458 -4202451 +4163872 -4330583 +4198190 -4238715 +4326177 -4186930 +4163951 -4242997 +40492359 -4047491 +4058899 diff --git a/slides/lesson1_slides.qmd b/slides/lesson1_slides.qmd index 2ff6c3e..a98cac5 100644 --- a/slides/lesson1_slides.qmd +++ b/slides/lesson1_slides.qmd @@ -15,6 +15,8 @@ output-location: fragment Please [sign-up for an account at Posit Cloud](https://login.posit.cloud/register "https://login.posit.cloud/register") and accept our classroom invitation here: +Our course website: + ## Introductions - Who am I? @@ -41,11 +43,11 @@ Please [sign-up for an account at Posit Cloud](https://login.posit.cloud/regist . . . -- +- Fundamentals of SQL query writing: filtering, joining, grouping. . . . -- +- Not so much about building your own database and optimizing it. ## Content of the course @@ -55,7 +57,7 @@ Please [sign-up for an account at Posit Cloud](https://login.posit.cloud/regist 3. \[No class week\] -4. Calculating new fields, `GROUP BY`, `CASE WHEN`, `HAVING` +4. Grouping and Aggregating variables 5. Subqueries, Views, **Pizza** @@ -187,6 +189,10 @@ Procedure Occurrence table ![](../img/omop0.png){width="550"} +## A short survey on your interest and background + + + ## Let's get started: connecting to the database ```{r, warning=FALSE} @@ -278,6 +284,20 @@ SELECT person_id, gender_source_value, race_source_value WHERE year_of_birth < 2000 ``` +## SQL Comparison Operators + +- Equal: `=` + +- Greater than: `>` + +- Less than: `<` + +- Greater than or equal to: `>=` + +- Less than or equal to: `<=` + +- Not equal to: `<>` + ## Single quotes and `WHERE` Single quotes ('M') refer to values, and double quotes refer to columns ("person_id"). diff --git a/slides/lesson2_slides.html b/slides/lesson2_slides.html new file mode 100644 index 0000000..a4ac7c1 --- /dev/null +++ b/slides/lesson2_slides.html @@ -0,0 +1,3247 @@ + + + + + + + + + + + + + Week 2: JOINs, More WHERE, Boolean Logic, ORDER BY + + + + + + + + + + + + + + + +
    +
    + +
    +

    Week 2: JOINs, More WHERE, Boolean Logic, ORDER BY

    + +
    +
    + +
    +
    +

    Table references

    +

    In single table queries, it is usually unambiguous to the query engine which column and which table you need to query.

    +

    However, when you involve multiple tables, it is important to know how to refer to a column in a specific table.

    +
    +

    For example:

    +
    +
    library(DBI)
    +
    +con <- DBI::dbConnect(duckdb::duckdb(), 
    +                      "../data/GiBleed_5.3_1.1.duckdb")
    +
    +
    + +
    +
    +
    SELECT person.person_id, person.year_of_birth
    +  FROM person
    +
    +
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Displaying records 1 - 10
    person_idyear_of_birth
    61963
    1231950
    1291974
    161971
    651967
    741972
    421909
    1871945
    181965
    1111975
    +
    +
    +
    +
    +

    Your turn to use table references:

    +
    +
    SELECT *
    +  FROM procedure_occurrence
    +  WHERE person_id = 1
    +
    +
    +
    + + ++++++++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    5 records
    procedure_occurrence_idperson_idprocedure_concept_idprocedure_dateprocedure_datetimeprocedure_type_concept_idmodifier_concept_idquantityprovider_idvisit_occurrence_idvisit_detail_idprocedure_source_valueprocedure_source_concept_idmodifier_source_value
    11447831961981-08-171981-08-17380002750NANA85069925300344783196NA
    2141259061982-09-111982-09-11380002750NANA8302880860094125906NA
    3142524191981-08-101981-08-10380002750NANA820740160014252419NA
    4141709471958-03-111958-03-11380002750NANA7902744740014170947NA
    5140474911958-03-111958-03-11380002750NANA79012250024047491NA
    +
    +
    +
    +
    +
    +

    Entity Relationship Diagrams

    + +
      +
    • For each person_id in the person table, there may be duplicated person_ids in procedure_occurrence table, as a patient can have multiple procedures. This is a one-to-many relationship.

    • +
    • Multiple elements of procedure_concept_id in the procedure_occurrence table may correspond to a single element of concept_id in the “concept” table. This is a many-to-one relationship.

    • +
    • You can also have a one-to-one relationship.

    • +
    + +
    +
    +

    Joins

    +

    To set the stage, let’s show two tables, x and y. We want to join them by the keys, which are represented by colored boxes in both of the tables.

    +

    +
    +

    In an INNER JOIN, we only retain rows that have elements that exist in both the x and y tables.

    +

    +
    +
    +
    +

    INNER JOIN syntax

    +
    +
    SELECT person.person_id, procedure_occurrence.procedure_occurrence_id 
    +    FROM person
    +    INNER JOIN procedure_occurrence
    +    ON person.person_id = procedure_occurrence.person_id
    +
    +
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Displaying records 1 - 10
    person_idprocedure_occurrence_id
    3433554
    3573741
    3993928
    4064115
    4114302
    4304489
    4424676
    4534863
    4695050
    4885237
    +
    +
    +
    +
      +
    • FROM person and INNER JOIN procedure_occurrence specifies the tables to be joined.

    • +
    • ON person.person_id = procedure_occurrence.person_id specifies the columns from each table for keys.

    • +
    +
    +
    +
    +

    Table References

    +

    We can short-hand the table names via the AT statement:

    +
    +
    SELECT p.person_id, po.procedure_occurrence_id 
    +    FROM person as p
    +    INNER JOIN procedure_occurrence as po
    +    ON p.person_id = po.person_id
    +
    +
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Displaying records 1 - 10
    person_idprocedure_occurrence_id
    3433554
    3573741
    3993928
    4064115
    4114302
    4304489
    4424676
    4534863
    4695050
    4885237
    +
    +
    +
    +
    +

    LEFT JOIN

    +

    If a row exists in the left table, but not the right table, it will be replicated in the joined table, but have rows with NULL columns from the right table.

    + +
    +

    We can see the difference between a INNER JOIN and LEFT JOIN by counting the number of rows kept after joining:

    +
    +
    SELECT COUNT (*)
    +    FROM person as p
    +    INNER JOIN procedure_occurrence as po
    +    ON p.person_id = po.person_id
    +
    +
    +
    + + + + + + + + + + + + +
    1 records
    count_star()
    37409
    +
    +
    +
    +
    +
    +
    SELECT COUNT (*)
    +    FROM person as p
    +    LEFT JOIN procedure_occurrence as po
    +    ON p.person_id = po.person_id
    +
    +
    +
    + + + + + + + + + + + + +
    1 records
    count_star()
    37510
    +
    +
    +

    This suggests that there are some unique person_ids in person table not found in the person_id of procedure_occurrence table.

    +
    +
    +
    +

    Other kinds of JOINs

    +
      +
    • The RIGHT JOIN is identical to LEFT JOIN, except that the rows preserved are from the right table.
    • +
    • The FULL JOIN retains all rows in both tables, regardless if there is a key match.
    • +
    • ANTI JOIN is helpful to find all of the keys that are in the left table, but not the right table
    • +
    +
    +
    +

    Multiple JOINs

    +

    Can we do a triple join?

    + +

    Suppose that we want a table with person.person_id, procedure_occurrence.procedure_occurrence_id, and concept.concept_name.

    +
    +

    Some suggested steps:

    +
      +
    1. We first INNER JOIN person and procedure_occurrence, to produce an output table
    2. +
    3. We take this output table and INNER JOIN it with concept.
    4. +
    +
    +
    +
    +

    Using JOIN with WHERE

    +

    Let’s add an additional WHERE where we only want those rows that have the concept_name of ’Subcutaneous immunotherapy`:

    +
    +
    SELECT p.person_id, po.procedure_occurrence_id, c.concept_name
    +  FROM person AS p
    +  INNER JOIN procedure_occurrence AS po
    +  ON p.person_id = po.person_id
    +  INNER JOIN concept AS c
    +  ON po.procedure_concept_id = c.concept_id
    +  WHERE c.concept_name = 'Subcutaneous immunotherapy';
    +
    +
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Displaying records 1 - 10
    person_idprocedure_occurrence_idconcept_name
    16289Subcutaneous immunotherapy
    1801958Subcutaneous immunotherapy
    9187Subcutaneous immunotherapy
    5119Subcutaneous immunotherapy
    36559Subcutaneous immunotherapy
    1241226Subcutaneous immunotherapy
    2252244Subcutaneous immunotherapy
    4094243Subcutaneous immunotherapy
    2362392Subcutaneous immunotherapy
    2602556Subcutaneous immunotherapy
    +
    +
    +
    +
    +

    Revisiting WHERE: AND versus OR

    +

    Revisiting WHERE, we can combine conditions with AND or OR.

    +

    AND is always going to be more restrictive than OR, because our rows must meet two conditions.

    +
    +
    SELECT COUNT(*)
    +  FROM person
    +  WHERE year_of_birth < 1980 
    +  AND gender_source_value = 'M'
    +
    +
    +
    + + + + + + + + + + + + +
    1 records
    count_star()
    1261
    +
    +
    +
    +

    On the other hand OR is more permissive than AND, because our rows must meet only one of the conditions.

    +
    +
    SELECT COUNT(*)
    +  FROM person
    +  WHERE year_of_birth < 1980 
    +  OR gender_source_value = 'M'
    +
    +
    +
    + + + + + + + + + + + + +
    1 records
    count_star()
    2629
    +
    +
    +
    +
    +

    There is also NOT, where one condition must be true, and the other must be false.

    +
    +
    SELECT COUNT(*)
    +  FROM person
    +  WHERE year_of_birth < 1980 
    +  AND NOT gender_source_value = 'M'
    +
    +
    +
    + + + + + + + + + + + + +
    1 records
    count_star()
    1308
    +
    +
    +
    +
    +
    +

    ORDER BY

    +

    ORDER BY lets us sort tables by one or more columns:

    +
    +
    SELECT p.person_id, po.procedure_occurrence_id, po.procedure_date
    +    FROM person as p
    +    INNER JOIN procedure_occurrence as po
    +    ON p.person_id = po.person_id
    +    ORDER BY p.person_id;
    +
    +
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Displaying records 1 - 10
    person_idprocedure_occurrence_idprocedure_date
    111981-08-17
    121982-09-11
    131981-08-10
    141958-03-11
    151958-03-11
    261955-10-22
    271977-04-08
    281931-09-03
    292007-09-04
    2101924-01-12
    +
    +
    +

    Once we sorted by person_id, we see that for every unique person_id, there can be multiple procedures! This suggests that there is a one-to-many relationship between person and procedure_occurrence tables.

    +
    +
    +

    Constraints and rules for Databases

    +

    Some constraints we can require on columns of a table:

    +
      +
    • Typed: such as INTEGER, VARCHAR
    • +
    • NOT NULL - no values can have a NULL value.
    • +
    • UNIQUE - all values must be unique.
    • +
    • PRIMARY KEY - NOT NULL and UNIQUE.
    • +
    • FOREIGN KEY - value must exist as a primary key in another table’s field. The referenced table’s field must be specified.
    • +
    • CHECK - check the data type and conditions. One example would be our data shouldn’t be before 1900.
    • +
    • DEFAULT - default values are given if not provided.
    • +
    +
    +
    +

    Primary keys

    +

    A PRIMARY KEY is required for any table, and cannot be NULL and must be unique. This gives an unique id for each entry of the table.

    +
    +

    When we create tables in our database, we need to specify which column is a PRIMARY KEY:

    +
    CREATE TABLE person (
    +  person_id INTEGER PRIMARY KEY
    +)
    +
    +
    +
    +

    Foreign keys

    +

    FOREIGN KEY involves two or more tables. If a column is declared a FOREIGN KEY, then that key value must exist in a REFERENCES table as a primary key.

    +
    +
    CREATE TABLE procedure_occurrence {
    +  procedure_occurrence_id PRIMARY KEY,
    +  person_id INTEGER REFERENCES person(person_id)
    +  procedure_concept_id INTEGER REFERENCES concept(concept_id)
    +}
    +
    +
    +
    +

    Always close the connection

    +

    When we’re done, it’s best to close the connection with dbDisconnect().

    +
    +
    dbDisconnect(con)
    +
    +
    + +
    +
    + +
    +
    +
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/slides/lesson2_slides.qmd b/slides/lesson2_slides.qmd new file mode 100644 index 0000000..7ea5dc2 --- /dev/null +++ b/slides/lesson2_slides.qmd @@ -0,0 +1,268 @@ +--- +title: "Week 2: JOINs, More WHERE, Boolean Logic, ORDER BY" +format: + revealjs: + smaller: true + scrollable: true + echo: true + embed-resources: true +output-location: fragment +--- + +## Table references + +In single table queries, it is usually unambiguous to the query engine which column and which table you need to query. + +However, when you involve multiple tables, it is important to know how to refer to a column in a specific table. + +. . . + +For example: + +```{r} +library(DBI) + +con <- DBI::dbConnect(duckdb::duckdb(), + "../data/GiBleed_5.3_1.1.duckdb") +``` + +```{sql connection="con"} +SELECT person.person_id, person.year_of_birth + FROM person +``` + +. . . + +Your turn to use table references: + +```{sql connection="con"} +SELECT * + FROM procedure_occurrence + WHERE person_id = 1 +``` + +## Entity Relationship Diagrams + +![](images/omop1.png) + +- For each `person_id` in the `person` table, there may be duplicated `person_id`s in `procedure_occurrence` table, as a patient can have multiple procedures. This is a **one-to-many relationship**. + +- Multiple elements of `procedure_concept_id` in the `procedure_occurrence` table may correspond to a single element of `concept_id` in the "concept" table. This is a **many-to-one relationship**. + +- You can also have a **one-to-one relationship**. + +. . . + +[OMOP CDM (Common Data Model)](https://ohdsi.github.io/CommonDataModel/index.html). + +## Joins + +To set the stage, let's show two tables, `x` and `y`. We want to join them by the **keys**, which are represented by colored boxes in both of the tables. + +![](images/original-dfs.png) + +. . . + +In an `INNER JOIN`, we only retain rows that have elements that exist in both the `x` and `y` tables. + +![](images/inner-join.gif) + +## `INNER JOIN` syntax + +```{sql connection="con"} +SELECT person.person_id, procedure_occurrence.procedure_occurrence_id + FROM person + INNER JOIN procedure_occurrence + ON person.person_id = procedure_occurrence.person_id +``` + +. . . + +1. `FROM person` and `INNER JOIN procedure_occurrence` specifies the tables to be joined. + +2. `ON person.person_id = procedure_occurrence.person_id` specifies the columns from each table for keys. + +3. Then, we `SELECT` for the columns we want to keep: `person.person_id, procedure_occurrence.procedure_occurrence_id` + +## Table References + +We can short-hand the table names via the `AS` statement: + +```{sql connection="con"} +SELECT p.person_id, po.procedure_occurrence_id + FROM person AS p + INNER JOIN procedure_occurrence AS po + ON p.person_id = po.person_id +``` + +## `LEFT JOIN` + +If a row exists in the left table, but not the right table, it will be replicated in the joined table, but have rows with `NULL` columns from the right table. + +![](images/left-join.gif) + +. . . + +We can see the difference between a `INNER JOIN` and `LEFT JOIN` by counting the number of rows kept after joining: + +```{sql} +#| connection: "con" +SELECT COUNT (*) + FROM person as p + INNER JOIN procedure_occurrence as po + ON p.person_id = po.person_id +``` + +. . . + +```{sql} +#| connection: "con" +SELECT COUNT (*) + FROM person as p + LEFT JOIN procedure_occurrence as po + ON p.person_id = po.person_id +``` + +This suggests that there are some unique `person_id`s in `person` table not found in the `person_id` of `procedure_occurrence` table. + +## Other kinds of `JOIN`s + +- The `RIGHT JOIN` is identical to `LEFT JOIN`, except that the rows preserved are from the *right* table. +- The `FULL JOIN` retains all rows in both tables, regardless if there is a key match. +- `ANTI JOIN` is helpful to find all of the keys that are in the *left* table, but not the *right* table + +## Multiple `JOIN`s + +Can we do a triple join? + +![](images/omop1.png) + +Suppose that we want a table with `person.person_id`, `procedure_occurrence.procedure_occurrence_id`, and `concept.concept_name`. + +. . . + +Some suggested steps: + +1. We first `INNER JOIN` `person` and `procedure_occurrence`, to produce an output table +2. We take this output table and `INNER JOIN` it with `concept`. + +## Using `JOIN` with `WHERE` + +Let's add an additional `WHERE` where we only want those rows that have the `concept_name` of 'Subcutaneous immunotherapy\`: + +```{sql connection="con"} +SELECT p.person_id, po.procedure_occurrence_id, c.concept_name + FROM person AS p + INNER JOIN procedure_occurrence AS po + ON p.person_id = po.person_id + INNER JOIN concept AS c + ON po.procedure_concept_id = c.concept_id + WHERE c.concept_name = 'Subcutaneous immunotherapy'; +``` + +## Revisiting `WHERE`: `AND` versus `OR` + +Revisiting `WHERE`, we can combine conditions with `AND` or `OR`. + +`AND` is always going to be more restrictive than `OR`, because our rows must meet two conditions. + +```{sql} +#| connection: "con" +SELECT COUNT(*) + FROM person + WHERE year_of_birth < 1980 + AND gender_source_value = 'M' +``` + +. . . + +On the other hand `OR` is more permissive than `AND`, because our rows must meet only one of the conditions. + +```{sql} +#| connection: "con" +SELECT COUNT(*) + FROM person + WHERE year_of_birth < 1980 + OR gender_source_value = 'M' +``` + +. . . + +There is also `NOT`, where one condition must be true, and the other must be false. + +```{sql} +#| connection: "con" +SELECT COUNT(*) + FROM person + WHERE year_of_birth < 1980 + AND NOT gender_source_value = 'M' +``` + +## `ORDER BY` + +`ORDER BY` lets us sort tables by one or more columns: + +```{sql} +#| connection: "con" +SELECT p.person_id, po.procedure_occurrence_id, po.procedure_date + FROM person as p + INNER JOIN procedure_occurrence as po + ON p.person_id = po.person_id + ORDER BY p.person_id; +``` + +. . . + +Once we sorted by `person_id`, we see that for every unique `person_id`, there can be multiple procedures! This suggests that there is a **one-to-many relationship** between `person` and `procedure_occurrence` tables. + +. . . + +We can `ORDER BY` multiple columns at once. Try ordering by `p.patient_id` and `po.procedure_date`... + +## Constraints and rules for Databases + +Some constraints we can require on columns of a table: + +- Typed: such as `INTEGER`, `VARCHAR` +- `NOT NULL` - no values can have a `NULL` value. +- `UNIQUE` - all values must be unique. +- `PRIMARY KEY` - `NOT NULL` and `UNIQUE`. +- `FOREIGN KEY` - value must exist as a primary key in another table's field. The referenced table's field must be specified. +- `CHECK` - check the data type and conditions. One example would be our data shouldn't be before 1900. +- `DEFAULT` - default values are given if not provided. + +## Primary keys + +A `PRIMARY KEY` is required for any table, and cannot be `NULL` and must be unique. This gives an unique id for each entry of the table. + +. . . + +When we create tables in our database, we need to specify which column is a `PRIMARY KEY`: + +``` +CREATE TABLE person ( + person_id INTEGER PRIMARY KEY +) +``` + +## Foreign keys + +`FOREIGN KEY` involves two or more tables. If a column is declared a `FOREIGN KEY`, then that key value must *exist* in a `REFERENCES` table as a primary key. + +. . . + +``` +CREATE TABLE procedure_occurrence { + procedure_occurrence_id PRIMARY KEY, + person_id INTEGER REFERENCES person(person_id) + procedure_concept_id INTEGER REFERENCES concept(concept_id) +} +``` + +## Always close the connection + +When we're done, it's best to close the connection with `dbDisconnect()`. + +```{r} +dbDisconnect(con) +``` diff --git a/week2.qmd b/week2.qmd index e882fab..fb0234b 100644 --- a/week2.qmd +++ b/week2.qmd @@ -47,38 +47,7 @@ SELECT * WHERE person_id = 1 ``` -## Aliases - -As your queries get more complex, and as you involve more and more tables, you will need to use aliases. I think of them like "nicknames" - they can save you a lot of typing. - -I tend to use the `AS` clause when I define them. I've used `AS` here to abbreviate `person`. I use it in two different places: in my `COUNT`, and in my `WHERE`: - -```{sql} -#| connection: "con" -SELECT COUNT(p.person_id) - FROM person AS p - WHERE p.year_of_birth < 2000; -``` - -Some people don't use `AS`, just putting the aliases next to the original name: - -```{sql} -#| connection: "con" -SELECT COUNT(p.person_id) - FROM person p - WHERE p.year_of_birth < 2000; -``` - -We can also rename variables using `AS`: - -```{sql} -#| connection: "con" -SELECT COUNT(person_id) AS person_count - FROM person - WHERE year_of_birth < 2000; -``` - -We will be using aliases and table references a lot when we start `JOIN`ing tables. +Let's get ready to work on queries involving multiple tables. ## Entity-relationship diagrams @@ -100,7 +69,7 @@ We should consider to what degree the values overlap: - You can also have a **one-to-one relationship**. -The database we\'ve been using has been rigorously modeled using a data model called [OMOP CDM (Common Data Model)](https://ohdsi.github.io/CommonDataModel/index.html). OMOP is short for Observational Medical Outcomes Partnership, and it is designed to be a database format that standardizes data from systems into a format that can be combined with other systems to compare health outcomes across organizations. The full OMOP entity relationship diagram can be [found here](https://ohdsi.github.io/CommonDataModel/cdm54erd.html). +The database we've been using has been rigorously modeled using a data model called [OMOP CDM (Common Data Model)](https://ohdsi.github.io/CommonDataModel/index.html). OMOP is short for Observational Medical Outcomes Partnership, and it is designed to be a database format that standardizes data from systems into a format that can be combined with other systems to compare health outcomes across organizations. The full OMOP entity relationship diagram can be [found here](https://ohdsi.github.io/CommonDataModel/cdm54erd.html). Now, let's join some tables. @@ -143,21 +112,52 @@ The last thing to note is the `ON` statement. These are the conditions by which ON person.person_id = procedure_occurrence.person_id ``` +## Aliases + +As your queries get more complex, and as you involve more and more tables, you will need to use **aliases**. I think of them like "nicknames" - they can save you a lot of typing. + Here is the same query using aliases. We use `p` as an alias for `person` and `po` as an alias for `procedure_occurrence`. You can see it is a little more compact. ```{sql} #| connection: "con" SELECT p.person_id, po.procedure_occurrence_id - FROM person as p - INNER JOIN procedure_occurrence as po + FROM person AS p + INNER JOIN procedure_occurrence AS po ON p.person_id = po.person_id ``` -## `LEFT JOIN` +### Another example -## Jargon alert +Here, I use table aliasing in two different places: in my `COUNT`, and in my `WHERE`: -The table to the **left** of the `JOIN` clause is called the **left table**, and the table to the **right** of the `JOIN` clause is known as the **right table**. This will become more important as we explore the different join types. +```{sql} +#| connection: "con" +SELECT COUNT(p.person_id) + FROM person AS p + WHERE p.year_of_birth < 2000; +``` + +Some people don't use `AS`, just putting the aliases next to the original name: + +```{sql} +#| connection: "con" +SELECT COUNT(p.person_id) + FROM person p + WHERE p.year_of_birth < 2000; +``` + +We can also rename variables using `AS`: + +```{sql} +#| connection: "con" +SELECT COUNT(person_id) AS person_count + FROM person + WHERE year_of_birth < 2000; +``` + +## `LEFT JOIN` + +Jargon alert: The table to the **left** of the `JOIN` clause is called the **left table**, and the table to the **right** of the `JOIN` clause is known as the **right table**. This will become more important as we explore the different join types. ``` FROM procedure_occurrence INNER JOIN concept @@ -200,32 +200,46 @@ This suggests that there are some unique `person_id`s in `person` table not foun ## Multiple `JOIN`s with Multiple Tables -We can have multiple joins by thinking them as a sequential operation of one join after another. In the below query we first `INNER JOIN` `person` and `procedure_occurrence`, and then use the output of that `JOIN` to `INNER JOIN` with `concept`: +![](img/omop1.png) -```{sql} -#| connection: "con" -SELECT p.person_id, po.procedure_occurrence_id, c.concept_name - FROM person AS p - INNER JOIN procedure_occurrence AS po - ON p.person_id = po.person_id - INNER JOIN concept AS c - ON po.procedure_concept_id = c.concept_id - LIMIT 10; -``` +Suppose that we want a table with `person.person_id`, `procedure_occurrence.procedure_occurrence_id`, and `concept.concept_name`. Looks like we need a triple join! The way I think of these multi-table joins is to decompose them into two joins: 1. We first `INNER JOIN` `person` and `procedure_occurrence`, to produce an output table 2. We take this output table and `INNER JOIN` it with `concept`. -Notice that both of these `JOIN`s have separate `ON` statements. For the first join, we have: +Give a try yourself: + +```{sql connection="con"} + +SELECT person.person_id, procedure_occurrence.procedure_occurrence_id + FROM + INNER JOIN + ON + +``` + +Then, add the third join: + +```{sql connection="con"} +SELECT person.person_id, procedure_occurrence.procedure_occurrence_id, concept.concept_name + FROM + INNER JOIN + ON + INNER JOIN + ON + +``` + +Some tips: Notice that both of these `JOIN`s have separate `ON` statements. For the first join, we could have: ``` INNER JOIN procedure_occurrence AS po ON p.person_id = po.person_id ``` -For the second `JOIN`, we have: +For the second `JOIN`, we could have: ``` INNER JOIN concept AS c @@ -240,6 +254,17 @@ For combining `INNER JOIN`s, we are looking for the subset of keys that exist in It's really important to check intermediate output and make sure that you are retaining the rows that you need in the final output. For example, I'd try the first join first and see that it contains the rows that I need before adding the second join. +Here is the solution: + +```{sql connection="con"} +SELECT p.person_id, po.procedure_occurrence_id, c.concept_name + FROM person AS p + INNER JOIN procedure_occurrence AS po + ON p.person_id = po.person_id + INNER JOIN concept AS c + ON po.procedure_concept_id = c.concept_id +``` + ## Using `JOIN` with `WHERE` Where we really start to cook with gas is when we combine `JOIN` with `WHERE`. Let's add an additional `WHERE` where we only want those rows that have the `concept_name` of 'Subcutaneous immunotherapy\`: @@ -284,7 +309,7 @@ SELECT po.person_id, c.concept_name I'm not the biggest fan of this, because it is often not clear what is a filtering clause and what is a joining clause, so I prefer to use `JOIN`/`ON` with a `WHERE`. -## Boolean Logic: `AND` versus `OR` +## Revisiting `WHERE`: `AND` versus `OR` Revisiting `WHERE`, we can combine conditions with `AND` or `OR`. @@ -346,13 +371,15 @@ SELECT p.person_id, po.procedure_occurrence_id, po.procedure_date ORDER BY ----, ---- ``` -## Transactions and Inserting Data +## Constraints and rules for Databases So far, we've only queried data, but not added data to databases. -As we've stated before, DuckDB is an Analytical database, not a transactional one. That means it prioritizes reading from data tables rather than inserting into them. Transactional databases, on the other hand, can handle multiple inserts from multiple users at once. They are made for *concurrent* transactions. +As we've stated before, DuckDB is an Analytical database, not a Transactional one. That means it prioritizes reading from data tables rather than inserting into them. Transactional databases, on the other hand, can handle multiple inserts from multiple users at once. They are made for *concurrent* transactions. + +We are not going to look at how to add to a database in this course, but we are going to examine what the *constraints* can be placed on a database, because this gives rules on what is allowed in our database to be queried. -Here is an example of what is called the *Data Definition Language* for our tables: +When one sets up a database, we also set up the constraints via a *Data Definition Language* for our tables: ``` sql CREATE TABLE @cdmDatabaseSchema.PERSON ( @@ -376,20 +403,18 @@ CREATE TABLE @cdmDatabaseSchema.PERSON ( ethnicity_source_concept_id integer NULL ); ``` -When we add rows into a database, we need to be aware of the *constraints* of the database. They exist to maintain the *integrity* of a database. +We've encountered one constraint: database fields (columns) need to be *typed*. For example, id keys are usually `INTEGER`. Names are often `VARCHAR`. -We've encountered one constraint: database fields need to be *typed*. For example, id keys are usually `INTEGER`. Names are often `VARCHAR`. +Here are some other constraints that can be applied to a field (column): -One contraint is the requirement for *unique keys* for each row. We cannot add a new row with a previous key value. - -- `NOT NULL` -- `UNIQUE` -- `PRIMARY KEY` - `NOT NULL` + `UNIQUE` -- `FOREIGN KEY` - value must exist as a key in another table +- `NOT NULL` - no values can have a `NULL` value. +- `UNIQUE` - all values must be unique. +- `PRIMARY KEY` - `NOT NULL` and `UNIQUE`. +- `FOREIGN KEY` - value must exist as a primary key in another table's field. The referenced table's field must be specified. - `CHECK` - check the data type and conditions. One example would be our data shouldn't be before 1900. -- `DEFAULT` - default values. +- `DEFAULT` - default values are given if not provided. -The most important ones to know about are `PRIMARY KEY` and `FOREIGN KEY`. `PRIMARY KEY` forces the database to create new rows with an automatically incremented id. +The most important constraints to know about are `PRIMARY KEY` and `FOREIGN KEY`. A `PRIMARY KEY` is required for any table, and cannot be `NULL` and must be unique. This gives an unique id for each entry of the table. When we create tables in our database, we need to specify which column is a `PRIMARY KEY`: @@ -399,7 +424,7 @@ CREATE TABLE person ( ) ``` -`FOREIGN KEY` involves two or more tables. If a column is declared a `FOREIGN KEY`, then that key value must *exist* in a REFERENCE table. Here our two reference tables are `person` and `procedure_occurrence`. +`FOREIGN KEY` involves two or more tables. If a column is declared a `FOREIGN KEY`, then that key value must *exist* in a `REFERENCES` table as a primary key. Here, when we create `procedure_occurrence`, `person_id` column `REFERENCES` the table `person`'s `person_id` primay key column, and `procedure_concept_id` column `REFERENCES` the table `concept`'s `concept_id` primary key column. ``` sql CREATE TABLE procedure_occurrence { @@ -411,8 +436,6 @@ CREATE TABLE procedure_occurrence { Thus, we can use constraints to make sure that our database retains its integrity when we add rows to it. -There are more constraints we can add to our tables, and the correct use of these constraints will ensure that our data is correct. - You can see an example of constraints for our database here: . ## Always close the connection