written by riya karumanchi & isabel sieh, cs124 staff team, winter '25/'26
First, form a group of 3 students to work together! Introduce yourselves to one another.
In logistic regression, we compute a real-valued score
and then turn it into a probability using the sigmoid (logistic) function
Where
With your group, work through this part. This is meant to be straightforward—it's a sanity check to make sure you can interpret the sigmoid curve correctly. Look at the sigmoid function plotted below:
- We call
$z$ , the input to the sigmoid, the "logit".
Below are three different scores (
$z = 0.2$ $z = 2$ $z = 6$
Question 1. Which one has the highest probability of
# Your answer here
Question 2. Which pair is closer together as probabilities:
# Your answer here
Use the following small reference table (you may treat these as given; no calculator needed):
$s(0)=0.5$ $s(2)\approx 0.88$ $s(-2)\approx 0.12$ $s(4)\approx 0.98$ $s(-4)\approx 0.02$
Question 3. Suppose two different feature vectors produce scores
# Your answer here
Question 3b (preview: temperature). Later in the course, when we work with language models, we will use a temperature parameter
In this simplified setting, we apply temperature by dividing the logit by
Question: Suppose the model produces logit
# Your answer here
Question 4. Now look at the negative logits:
# Your answer here
We will now go back to the whole class and discuss group answers for Part 1 in a plenary session.
For the following problem, please choose a group facilitator/representative who will also take notes on your discussion.
Each document (sentence/comment) is converted into a feature vector
-
score (logit):
$z = w \cdot x + b$ -
$x$ is the feature for the document -
$w$ is the weight for the feature -
$b$ is a bias term (a constant offset)
-
-
predicted probability:
$\hat{y} = s(z)$ , where$s(z)=\frac{1}{1+\exp(-z)}$ $\hat{y}$ is the model's predicted probability that$y=1$ for this document. -
true label:
$y \in {0,1}$ is the correct label for the document-
$y=1$ : positive (or "class 1") -
$y=0$ : negative (or "class 0")
-
The loss we use in lecture is the cross-entropy loss (
Here is what each term means, in plain language:
-
$L_{CE}$ : the loss (how "wrong" the model is on this example) -
$w$ : the weight for the feature -
$x$ : the value of the feature for this document -
$w \cdot x + b$ : the score ($z$ ) (total evidence before applying sigmoid) -
$s(w \cdot x + b)$ : the predicted probability ($\hat{y}$ ) -
$\hat{y}-y$ : the "error term" (positive if we predicted too high; negative if we predicted too low)
Finally, stochastic gradient descent (SGD) updates parameters by moving against the gradient.
where:
-
$q$ is the parameter vector (it contains$w$ and$b$ ) -
$h$ is the learning rate -
$g$ is the gradient
We will classify a movie review comment using a single feature:
-
$x$ = number of positive words in the comment (from a small positive lexicon)
Consider the comment:
"The acting was great and the soundtrack was incredible, and the cinematography was amazing."
Assume our positive lexicon contains: {great, incredible, amazing}
So the feature is:
-
$x = 3$ (great, incredible, amazing)
We will start with:
-
$w = 0$ ,$b = 0$
So initially:
Let the learning rate be
- Compute
$\hat{y}-y$ at initialization. Is it positive or negative?
# Your answer here
- Using
$\frac{\partial L_{CE}}{\partial w} = (\hat{y}-y)x$ , determine the sign (positive or negative) of$\frac{\partial L_{CE}}{\partial w}$ .
# Your answer here
- Gradient descent updates:
$w \leftarrow w - h\frac{\partial L_{CE}}{\partial w}$ . Will$w$ increase or decrease?
# Your answer here
3b. The bias also updates:
# Your answer here
- After this update, will the new score
$z = w\cdot x + b$ be larger or smaller than before? Therefore, will$\hat{y}=s(z)$ move toward 1 or toward 0?
# Your answer here
We will now reconvene as a class to discuss Case 1 before moving on.
Repeat Questions 1–4, but with
- What is the sign of
$\hat{y}-y$ now?
# Your answer here
- Will
$w$ increase or decrease?
# Your answer here
6b. Will
# Your answer here
- Will
$z$ increase or decrease? Will$\hat{y}$ move toward 1 or toward 0?
# Your answer here
- In one sentence: explain why
$(\hat{y}-y)$ makes sense as an "error signal."
# Your answer here
- In one sentence: explain why multiplying by
$x$ makes sense (why a feature that appears more should change its weight more).
# Your answer here
We will now go back to the whole class and discuss group answers for Part 2 in a plenary session.
In Parts 1 and 2, you explored how logistic regression computes a probability
Imagine you are using your model to screen resumes for a software engineering role. The model outputs the probability
You must decide where to set the threshold for which candidates move on. If you set a high threshold (e.g., 0.9), you only interview "sure bets" according to the model. If you set it lower (e.g., 0.4), you interview more people, including those the model is unsure about.
In Lab 1, we discussed how the origins of a dataset (like the New York Times Annotated Corpus) can introduce narrow perspectives or historical biases into a model.
- If you set a very high threshold (e.g., 0.9), how does this choice interact with existing biases in your training data?
# Your answer here
- If you lower the threshold to 0.3 to be more inclusive, what is the "cost" to the hiring team? How do you narrow in on a fair threshold?
# Your answer here
- Beyond accuracy, what are some considerations or metrics a team should look at when deciding on a threshold for a hiring tool? How could you evaluate the fairness of your chosen threshold?
# Your answer here
