Skip to content

Conversation

@robbie-c
Copy link
Member

@robbie-c robbie-c commented Jan 22, 2026

Changes

Adds a section to the engineering handbook about person processing.

As this is an area that touches capture, ingestion, clickhouse, hogql and queries, the whole system is not owned by any one team. To that end, I thought it would be useful to provide a high-level picture of how pieces fit together.

Triggered by https://posthog.slack.com/archives/C08JQTX5RRP/p1767878236120559

Checklist

  • Words are spelled using American English
  • PostHog product names are in title case. It's "Product Analytics" not "Product analytics". If talking about a category of product, use sentence case e.g. "There are a lot of product analytics tools, but PostHog's Product Analytics is the best"
  • Titles are in sentence case
  • Feature names are in sentence case. It's "Click here to create a trend insight" not "... create a Trend Insight" and so on.
  • Use relative URLs for internal links
  • If I moved a page, I added a redirect in vercel.json
  • Remove this template if you're not going to fill it out!

Article checklist

  • I've added (at least) 3-5 internal links to this new article
  • I've added keywords for this page to the rank tracker in Ahrefs
  • I've checked the preview build of the article
  • The date on the article is today's date
  • I've added this to the relevant "Tutorials and guides" docs page (if applicable)

@vercel
Copy link

vercel bot commented Jan 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
posthog Error Error Jan 23, 2026 11:38am

Request Review

@robbie-c robbie-c requested review from a team January 22, 2026 13:43
Copy link
Member

@pauldambra pauldambra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing

Copy link
Contributor

@vdekrijger vdekrijger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, great read and also helped me better understand the PoE thing you want to look into 🙌 !


A `distinct_id` is an identifier attached to every event. It's how we know which person an event belongs to. A person can have multiple distinct IDs (e.g., an anonymous session ID and a logged-in user ID).

Some example Distinct ID formats are: the user's email address, a UUID randomly generated by a client SDK, the primary key id in the customer's `User` table in their database, a Stripe `cus_xxx` ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Some example Distinct ID formats are: the user's email address, a UUID randomly generated by a client SDK, the primary key id in the customer's `User` table in their database, a Stripe `cus_xxx` ID.
Some commonly used Distinct ID formats are: the user's email address, a UUID randomly generated by a client SDK, the primary key id in the customer's `User` table in their database, a Stripe `cus_xxx` ID.

Copy link
Contributor

@gesh gesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thank you for putting all the knowledge in one place!

(Cookieless events use a placeholder distinct ID, which is replaced later with a privacy-preserving hash. The placeholder is not suitable as a partioning key, as it is always the same value for every cookieless event, so IP address is used)

**Implications**:
- Events with the **same** distinct_id go to the **same** Kafka partition → ordering preserved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the events order be changed before inserting them into Kafka?
For example:

  1. We have Event A and Event B (in this order).
  2. They are sent in two separate calls to /capture endpoint
  3. Event A is slowly processed by one Rust process
  4. In parallel, Event B is processed faster in another Rust process
  5. Event B is ingested into the Kafka topic
  6. Event A is ingested into the Kafka topic

If that's true, and we have $identify -> customEvent, but the customEvent is processed first, will we set the correct person_id to it. customEvent has the identify uuid, which is different compared to the anon user uuid, and we haven't created a person for it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a previous job we processed data with a sliding window to re-order it
but it was very expensive

i think the reason we have the confusing squash/override/etc is to keep ingestion cheap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(mostly commenting so i get a notification when the correct answer appears)

Copy link
Member Author

@robbie-c robbie-c Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me that this could happen, but I would probably defer to @PostHog/team-ingestion

Do we detect that this happened when event A is processed, and spit out an override?


---

## System overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

* edits

* progress

* processing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants