Problem
We (GoJek) use Raccoon currently to source clickstream events from the gojek app. The concrete product proto contains an event_timestamp field which the downstream systems such as DWH can use to partition the data on. However we see some amount of data arrives in partitions in future dates while some other arrive at different days for the same event timestamp date. There are 2 scenarios that causes this issue:
- The time/clock in the mobile app is reset by the user to a future date
- The app was inactive and those events were sent at a later point of time by the mobile sdk
Is there any workaround?
The DWH can partition based on a field which is like an ingestion time into the warehouse. However this needs backfills & repartitions on existing data and the upstream applications may need to change the way they query.
What is the impact?
Upstream applications' & services' query returns erroneous results
Which version was this found?
NA
Solution
Raccoon needs to provide an ingestion time for each event. The ingestion time should be considered as the time it was ingested into raccoon. This enables DWH to partition data based on the ingestion time as an alternate option to event_timestamp.
Problem
We (GoJek) use Raccoon currently to source clickstream events from the gojek app. The concrete product proto contains an
event_timestampfield which the downstream systems such as DWH can use to partition the data on. However we see some amount of data arrives in partitions in future dates while some other arrive at different days for the same event timestamp date. There are 2 scenarios that causes this issue:Is there any workaround?
The DWH can partition based on a field which is like an ingestion time into the warehouse. However this needs backfills & repartitions on existing data and the upstream applications may need to change the way they query.
What is the impact?
Upstream applications' & services' query returns erroneous results
Which version was this found?
NA
Solution
Raccoon needs to provide an ingestion time for each event. The ingestion time should be considered as the time it was ingested into raccoon. This enables DWH to partition data based on the ingestion time as an alternate option to event_timestamp.