written by riya karumanchi & isabel sieh, cs124 staff team, winter '25/'26
First, form a group of 3 students to work together! Introduce yourselves to one another.
In logistic regression, we compute a real-valued score
and then turn it into a probability using the sigmoid (logistic) function
Where
With your group, work through this part. This is meant to be straightforward—it's a sanity check to make sure you can interpret the sigmoid curve correctly. Look at the sigmoid function plotted below:
- We call
$z$ , the input to the sigmoid, the "logit".
Below are three different scores (
$z = 0.2$ $z = 2$ $z = 6$
Question 1. Which one has the highest probability of
# Answer below
Rank the probabilities:
- Highest probability of
$y=1$ :$z=6$ - Lowest probability of
$y=1$ :$z=0.2$
Justification: sigmoid is increasing; larger
Question 2. Which pair is closer together as probabilities:
# Answer below
Which pair is closer together as probabilities?
Justification: the sigmoid curve is flatter for large positive
Use the following small reference table (you may treat these as given; no calculator needed):
$s(0)=0.5$ $s(2)\approx 0.88$ $s(-2)\approx 0.12$ $s(4)\approx 0.98$ $s(-4)\approx 0.02$
Question 3. Suppose two different feature vectors produce scores
# Answer below
Compare
From the table:
$s(2)\approx 0.88$ $s(4)\approx 0.98$
So:
Key point: Even though the score
Question 3b (preview: temperature). Later in the course, when we work with language models, we will use a temperature parameter
In this simplified setting, we apply temperature by dividing the logit by
Question: Suppose the model produces logit
# Answer below
Higher.
With
So
Intuition: A low temperature "sharpens" the sigmoid—it makes the output more extreme (closer to 0 or 1). Dividing by a small
Question 4. Now look at the negative logits:
# Answer below
They sum to 1! For example:
$s(2) + s(-2) \approx 0.88 + 0.12 = 1$ $s(4) + s(-4) \approx 0.98 + 0.02 = 1$
This is the symmetry property of the sigmoid:
Logits of equal magnitude but opposite sign produce probabilities that "mirror" around 0.5.
We will now go back to the whole class and discuss group answers for Part 1 in a plenary session.
For the following problem, please choose a group facilitator/representative who will also take notes on your discussion.
Each document (sentence/comment) is converted into a feature vector
-
score (logit):
$z = w \cdot x + b$ -
$x$ is the feature for the document -
$w$ is the weight for the feature -
$b$ is a bias term (a constant offset)
-
-
predicted probability:
$\hat{y} = s(z)$ , where$s(z)=\frac{1}{1+\exp(-z)}$ $\hat{y}$ is the model's predicted probability that$y=1$ for this document. -
true label:
$y \in {0,1}$ is the correct label for the document-
$y=1$ : positive (or "class 1") -
$y=0$ : negative (or "class 0")
-
The loss we use in lecture is the cross-entropy loss (
Here is what each term means, in plain language:
-
$L_{CE}$ : the loss (how "wrong" the model is on this example) -
$w$ : the weight for the feature -
$x$ : the value of the feature for this document -
$w \cdot x + b$ : the score ($z$ ) (total evidence before applying sigmoid) -
$s(w \cdot x + b)$ : the predicted probability ($\hat{y}$ ) -
$\hat{y}-y$ : the "error term" (positive if we predicted too high; negative if we predicted too low)
Finally, stochastic gradient descent (SGD) updates parameters by moving against the gradient.
where:
-
$q$ is the parameter vector (it contains$w$ and$b$ ) -
$h$ is the learning rate -
$g$ is the gradient
We will classify a movie review comment using a single feature:
-
$x$ = number of positive words in the comment (from a small positive lexicon)
Consider the comment:
"The acting was great and the soundtrack was incredible, and the cinematography was amazing."
Assume our positive lexicon contains: {great, incredible, amazing}
So the feature is:
-
$x = 3$ (great, incredible, amazing)
We will start with:
-
$w = 0$ ,$b = 0$
So initially:
Let the learning rate be
- Compute
$\hat{y}-y$ at initialization. Is it positive or negative?
# Answer below
Compute
Negative.
- Using
$\frac{\partial L_{CE}}{\partial w} = (\hat{y}-y)x$ , determine the sign (positive or negative) of$\frac{\partial L_{CE}}{\partial w}$ .
# Answer below
Sign of gradient:
The gradient is negative.
- Gradient descent updates:
$w \leftarrow w - h\frac{\partial L_{CE}}{\partial w}$ . Will$w$ increase or decrease?
# Answer below
Update subtracts the gradient, so subtracting a negative value increases the weight:
-
$w$ increases
(Optional numeric update:)
3b. The bias also updates:
# Answer below
Since
-
$b$ increases
- After this update, will the new score
$z = w\cdot x + b$ be larger or smaller than before? Therefore, will$\hat{y}=s(z)$ move toward 1 or toward 0?
# Answer below
Since
(Optional numeric check:)
We will now reconvene as a class to discuss Case 1 before moving on.
Repeat Questions 1–4, but with
- What is the sign of
$\hat{y}-y$ now?
# Answer below
Compute
Positive.
- Will
$w$ increase or decrease?
# Answer below
Sign of gradient:
Subtracting a positive gradient decreases the weight:
-
$w$ decreases
(Optional numeric update:)
6b. Will
# Answer below
Since
-
$b$ decreases
- Will
$z$ increase or decrease? Will$\hat{y}$ move toward 1 or toward 0?
# Answer below
With a positive feature and a smaller (more negative) weight,
(Optional numeric check:)
- In one sentence: explain why
$(\hat{y}-y)$ makes sense as an "error signal."
# Answer below
- if
$\hat{y}>y$ , then$\hat{y}-y>0$ and the update pushes$z$ down - if
$\hat{y}<y$ , then$\hat{y}-y<0$ and the update pushes$z$ up
- In one sentence: explain why multiplying by
$x$ makes sense (why a feature that appears more should change its weight more).
# Answer below
Multiplying by
We will now go back to the whole class and discuss group answers for Part 2 in a plenary session.
In Parts 1 and 2, you explored how logistic regression computes a probability
Imagine you are using your model to screen resumes for a software engineering role. The model outputs the probability
You must decide where to set the threshold for which candidates move on. If you set a high threshold (e.g., 0.9), you only interview "sure bets" according to the model. If you set it lower (e.g., 0.4), you interview more people, including those the model is unsure about.
In Lab 1, we discussed how the origins of a dataset (like the New York Times Annotated Corpus) can introduce narrow perspectives or historical biases into a model.
- If you set a very high threshold (e.g., 0.9), how does this choice interact with existing biases in your training data?
# Answer below
If the training data is historically biased (e.g., favoring specific universities or demographics), the model learns that these features are high-confidence indicators of success. Setting a high threshold means you are only selecting the safest bets according to the model, which effectively locks in and automates those historical inequities. As a result, candidates with non-traditional backgrounds often have feature vectors that the model hasn't seen frequently, so the model assigns them lower confidence scores. A high threshold acts as a rigid and abstract barrier that these candidates can find difficult to cross, even if they are highly qualified.
- If you lower the threshold to 0.3 to be more inclusive, what is the "cost" to the hiring team? How do you narrow in on a fair threshold?
# Answer below
Lowering the threshold to 0.3 makes the process more inclusive, but it shifts the burden from the algorithm to the humans. The "cost" here is time and efficiency: the hiring team will have to manually review many more resumes, many of which might not be a good fit. Narrowing in on a fair threshold is moreso about avoiding overreliance on automatic classifications than finding a magic threshold number. A fair approach might involve a system in which the model handles the obvious rejections, but humans take over for any candidate in the middle range. This system can also be calibrated to ensure the threshold yields a representative interview pool.
- Beyond accuracy, what are some considerations or metrics a team should look at when deciding on a threshold for a hiring tool? How could you evaluate the fairness of your chosen threshold?
# Answer below
Selection Rate: Does the threshold result in a selection rate for one group that is significantly lower than the highest-performing group?
False Negative Rate: How many highly qualified people are we rejecting? In hiring, a False Negative (missing a great hire) is often a bigger loss for diversity than a False Positive (interviewing someone who isn't a fit).
Feedback Loops: Is the threshold narrowing our future training data? If we only hire
Evaluation Strategy: Test the threshold on a test set of diverse resumes where you already know the outcome to see if the model/threshold combination would have excluded them.
