-
Notifications
You must be signed in to change notification settings - Fork 731
Add docs on person processing #14512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
pauldambra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing
Co-authored-by: Paul D'Ambra <paul@posthog.com>
vdekrijger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, great read and also helped me better understand the PoE thing you want to look into 🙌 !
|
|
||
| A `distinct_id` is an identifier attached to every event. It's how we know which person an event belongs to. A person can have multiple distinct IDs (e.g., an anonymous session ID and a logged-in user ID). | ||
|
|
||
| Some example Distinct ID formats are: the user's email address, a UUID randomly generated by a client SDK, the primary key id in the customer's `User` table in their database, a Stripe `cus_xxx` ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Some example Distinct ID formats are: the user's email address, a UUID randomly generated by a client SDK, the primary key id in the customer's `User` table in their database, a Stripe `cus_xxx` ID. | |
| Some commonly used Distinct ID formats are: the user's email address, a UUID randomly generated by a client SDK, the primary key id in the customer's `User` table in their database, a Stripe `cus_xxx` ID. |
gesh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thank you for putting all the knowledge in one place!
| (Cookieless events use a placeholder distinct ID, which is replaced later with a privacy-preserving hash. The placeholder is not suitable as a partioning key, as it is always the same value for every cookieless event, so IP address is used) | ||
|
|
||
| **Implications**: | ||
| - Events with the **same** distinct_id go to the **same** Kafka partition → ordering preserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the events order be changed before inserting them into Kafka?
For example:
- We have Event A and Event B (in this order).
- They are sent in two separate calls to
/captureendpoint - Event A is slowly processed by one Rust process
- In parallel, Event B is processed faster in another Rust process
- Event B is ingested into the Kafka topic
- Event A is ingested into the Kafka topic
If that's true, and we have $identify -> customEvent, but the customEvent is processed first, will we set the correct person_id to it. customEvent has the identify uuid, which is different compared to the anon user uuid, and we haven't created a person for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a previous job we processed data with a sliding window to re-order it
but it was very expensive
i think the reason we have the confusing squash/override/etc is to keep ingestion cheap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(mostly commenting so i get a notification when the correct answer appears)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me that this could happen, but I would probably defer to @PostHog/team-ingestion
Do we detect that this happened when event A is processed, and spit out an override?
|
|
||
| --- | ||
|
|
||
| ## System overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
* edits * progress * processing
Changes
Adds a section to the engineering handbook about person processing.
As this is an area that touches capture, ingestion, clickhouse, hogql and queries, the whole system is not owned by any one team. To that end, I thought it would be useful to provide a high-level picture of how pieces fit together.
Triggered by https://posthog.slack.com/archives/C08JQTX5RRP/p1767878236120559
Checklist
vercel.jsonArticle checklist