diff --git a/community/governance.md b/community/governance.md index 087b3599dbf..b13c07fbeba 100644 --- a/community/governance.md +++ b/community/governance.md @@ -217,8 +217,9 @@ The "RFC" (request for comments) process is intended to provide a consistent and 2. Users, Contributors, and Maintainers discuss and upvote the draft 3. If confident on its success, contributor completes the RFC with more in-detail technical specifications 4. Maintainers approve RFC when it is ready -5. Maintainers meet every quarter and choose three or five items based on popularity and alignment with project vision and goals -6. Those selected items become part of the Mid-term goals +5. Once finalized, the RFC should be added as an [Architecture Decision Record (ADR)](../docs/adr/README.md) in the repository +6. Maintainers meet every quarter and choose three or five items based on popularity and alignment with project vision and goals +7. Those selected items become part of the Mid-term goals ### When to Use RFCs diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index a76f593b643..befa359e5c8 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -181,3 +181,15 @@ * [Versioning policy](project/versioning-policy.md) * [Release process](project/release-process.md) * [Feast 0.9 vs Feast 0.10+](project/feast-0.9-vs-feast-0.10+.md) +* [Architecture Decision Records](adr/README.md) + * [ADR-0001: Feature Services](adr/ADR-0001-feature-services.md) + * [ADR-0002: Component Refactor](adr/ADR-0002-component-refactor.md) + * [ADR-0003: On-Demand Transformations](adr/ADR-0003-on-demand-transformations.md) + * [ADR-0004: Entity Join Key Mapping](adr/ADR-0004-entity-join-key-mapping.md) + * [ADR-0005: Stream Transformations](adr/ADR-0005-stream-transformations.md) + * [ADR-0006: Kubernetes Operator](adr/ADR-0006-kubernetes-operator.md) + * [ADR-0007: Unified Feature Transformations](adr/ADR-0007-unified-feature-transformations.md) + * [ADR-0008: Feature View Versioning](adr/ADR-0008-feature-view-versioning.md) + * [ADR-0009: Contribution and Extensibility](adr/ADR-0009-contribution-extensibility.md) + * [ADR-0010: Vector Database Integration](adr/ADR-0010-vector-database-integration.md) + * [ADR-0011: Data Quality Monitoring](adr/ADR-0011-data-quality-monitoring.md) diff --git a/docs/adr/ADR-0001-feature-services.md b/docs/adr/ADR-0001-feature-services.md new file mode 100644 index 00000000000..c234e67198f --- /dev/null +++ b/docs/adr/ADR-0001-feature-services.md @@ -0,0 +1,87 @@ +# ADR-0001: Feature Services + +## Status + +Accepted + +## Context + +Feast's Feature Views allowed for storage-level grouping of features based on how they are produced. However, there was no concept of a retrieval-level grouping of features that maps to models. Without this: + +- There was no way to track which features were used to train a model or serve a specific model. +- Retrieving features during training required a complete list of features to be provided and persisted manually, which was error-prone. +- There was no way to ensure consumers wouldn't face breaking changes when feature views changed. + +## Decision + +Introduce a `FeatureService` object that allows users to define which features to use for a specific ML use case. A feature service groups features from one or more feature views for model training and online serving. + +### API Design + +Feature services use a Pandas-like API where feature views can be referenced directly: + +```python +from feast import FeatureService + +feature_service = FeatureService( + name="my_model_v1", + features=[ + shop_raw, # select all features + customer_sales[["average_order_value", "max_order_value"]], # select specific features + ], +) +``` + +Feature selection with aliasing: + +```python +feature_service = FeatureService( + name="my_model_v1", + features=[ + shop_raw, + customer_sales[["average_order_value", "max_order_value"]] + .alias({"average_order_value": "avg_o_val"}), + ], +) +``` + +### Retrieval + +```python +# Online inference +row = store.get_online_features( + feature_service="my_model_v1", + entity_rows=[{"customer_id": 123, "shop_id": 456}], +).to_dict() + +# Training +historical_df = store.get_historical_features( + feature_service="my_model_v1", + entity_df=entity_df, +) +``` + +### Key Decisions + +- **Name**: `FeatureService` was chosen over `FeatureSet` because it conveys the concept of a serving layer bridging models and data. `FeatureService` is analogous to model services in model serving systems. +- **Mutability**: Feature services are mutable. Immutability may be considered in the future. +- **Versioning**: Not included in the first version; users manage versions through naming conventions. + +## Consequences + +### Positive + +- Users can track which features are used for training and serving specific models. +- Provides a consistent interface for both online and offline feature retrieval. +- Reduces error-prone manual feature list management. +- Enables future functionality like logging, monitoring, and endpoint provisioning. + +### Negative + +- Adds another abstraction layer to the Feast data model. +- Feature services are mutable, which may lead to inconsistencies if not carefully managed. + +## References + +- Original RFC: [Feast RFC-015: Feature Services](https://docs.google.com/document/d/1jC0RJbyYLilXTOrLVBeR22PYLK5fe2JmQK1mKdZ-eno/edit) +- Implementation: `sdk/python/feast/feature_service.py` diff --git a/docs/adr/ADR-0002-component-refactor.md b/docs/adr/ADR-0002-component-refactor.md new file mode 100644 index 00000000000..854a1c3765b --- /dev/null +++ b/docs/adr/ADR-0002-component-refactor.md @@ -0,0 +1,71 @@ +# ADR-0002: Component Refactor + +## Status + +Accepted + +## Context + +The Feast project originally existed as a single monolithic repository containing many tightly coupled components: Core Registry, Serving Service, Job Service, Client Libraries, Spark ingestion code, Helm charts, and Terraform configurations. + +Two distinct user groups were identified: + +- **Platform teams**: Capable of running a complete feature store on Kubernetes with Spark, managing large-scale infrastructure. +- **Solution teams**: Small data science or data engineering teams wanting to solve ML business problems without deploying and managing Kubernetes or Spark clusters. + +Delivering a viable minimal product to solution teams required a lighter-weight approach. However, the monolithic codebase made this difficult due to tight coupling between components. + +## Decision + +Adopt a staged approach to decouple the Feast codebase into modular, composable components: + +### Stage 1: Move Out Non-Core Components + +Split the monorepo into focused repositories: + +- **feast** (main repo): Feast Python SDK, Documentation, and Protos (starting at v0.10.0). +- **feast-java**: Core Registry, Serving, and Java Client. +- **feast-spark**: Spark Ingestion, Spark Python SDK, and Job Service. +- **feast-helm-charts**: Helm charts for Kubernetes deployments. + +### Stage 2: Document Contracts + +Document all component-level contracts (I/O), API specifications (Protobuf), data contracts, and architecture diagrams. + +### Stage 3: Remove Coupling + +Remove unnecessary coupling between components, keeping only service contracts (Protobuf), data contracts, and integration tests as shared dependencies. + +### Stage 4: Converge + +Reverse the relationship so the main Feast SDK can use Spark-related code as a specific compute provider, rather than requiring it. + +### Key Principles + +- The main Feast repository provides a fully functional Python-based feature store that works without infrastructure dependencies. +- Spark and Kubernetes-based components remain available for platform teams. +- All existing functionality is maintained with no breaking changes during the transition. + +## Consequences + +### Positive + +- Enabled a super lightweight core framework for Feast that teams can start with in seconds. +- Made it possible for teams to pick and choose components they want to adopt. +- Teams with existing internal implementations (ingestion, registry, serving) can integrate more easily. +- The Python SDK became the primary entry point, significantly lowering the barrier to getting started. + +### Negative + +- Temporary divergence between Feast and Feast-Spark codebases during the transition. +- Multiple repositories added coordination overhead during the migration period. + +### Neutral + +- Components have since been reconverged into the main repository with a cleaner separation of concerns. +- The Go, Java, and Python SDKs coexist in the main repository under separate directories. + +## References + +- Original RFC: [Feast RFC-020: Component Refactor](https://docs.google.com/document/d/1CjR3Ph3l65hF5bRuchR9u9WSoirnIuEb7ILY9Ioh1Sk/edit) +- GitHub Discussion: [#1353](https://github.com/feast-dev/feast/discussions/1353) diff --git a/docs/adr/ADR-0003-on-demand-transformations.md b/docs/adr/ADR-0003-on-demand-transformations.md new file mode 100644 index 00000000000..66365d0358c --- /dev/null +++ b/docs/adr/ADR-0003-on-demand-transformations.md @@ -0,0 +1,97 @@ +# ADR-0003: On-Demand Transformations + +## Status + +Accepted + +## Context + +For many ML use cases, it is not possible or feasible to precompute and persist feature values for serving: + +- **Transactional use cases**: Inputs are part of the transaction/booking/order event. +- **Clickstream use cases**: User event data contains raw data used for feature engineering. +- **Location-based use cases**: Distance calculations between feature views (e.g., customer and driver locations). +- **Time-dependent features**: e.g., `user_account_age = current_time - account_creation_time`. +- **Crossed features**: e.g., user-user, user-tweet based features where the keyspace is too large to precompute. + +Additionally, Feast did not provide a means for post-processing features, forcing all feature development to upstream systems. + +## Decision + +Introduce **On-Demand Feature Views** as a feature transformation layer with the following properties: + +- Transformations execute at retrieval time (post-processing step after reading from the store). +- The calling client can input data as part of the retrieval request via a `RequestSource`. +- Users define arbitrary transformations on both stored features and request-time input data. +- Transformations are row-level operations only (no aggregations). + +### Definition API + +Uses the `@on_demand_feature_view` decorator (Option 3 from the RFC was chosen): + +```python +from feast import on_demand_feature_view, Field, RequestSource +from feast.types import Float64, String + +input_request = RequestSource( + name="transaction", + schema=[Field(name="input_lat", dtype=Float64), Field(name="input_lon", dtype=Float64)], +) + +@on_demand_feature_view( + sources=[driver_fv, input_request], + schema=[Field(name="distance", dtype=Float64)], +) +def driver_distance(inputs: pd.DataFrame) -> pd.DataFrame: + from haversine import haversine + df = pd.DataFrame() + df["distance"] = inputs.apply( + lambda r: haversine((r["lat"], r["lon"]), (r["input_lat"], r["input_lon"])), + axis=1, + ) + return df +``` + +### Retrieval + +```python +# Online - request data passed as entity rows +features = store.get_online_features( + features=["driver_distance:distance"], + entity_rows=[{"driver_id": 1001, "input_lat": 1.234, "input_lon": 5.678}], +).to_dict() + +# Offline - request data columns included in entity_df +df = store.get_historical_features( + entity_df=entity_df_with_request_columns, + features=["driver_distance:distance"], +).to_df() +``` + +### Key Decisions + +- **Decorator approach** chosen over adding transforms to FeatureService or FeatureView directly. This avoids changing existing APIs and keeps transformations self-contained. +- **Pandas DataFrames** as the input/output type to support vectorized operations. +- **All imports must be self-contained** within the function block for serialization. +- **Offline transformations** initially execute client-side using Dask for scalability. +- **Feature Transformation Server (FTS)** handles online transformations via HTTP/REST, deployed at `apply` time. + +## Consequences + +### Positive + +- Enables real-time feature engineering that depends on request-time data. +- Keeps feature logic co-located with feature definitions in the repository. +- Provides a consistent interface for both online and offline feature retrieval. +- The FTS allows horizontal scaling independent of feature serving. + +### Negative + +- Adds computational overhead to the serving path since transformations run at read time. +- On-demand feature views are limited to row-level transformations (no aggregations). +- Python function serialization requires self-contained imports within function blocks. + +## References + +- Original RFC: [Feast RFC-021: On-Demand Transformations](https://docs.google.com/document/d/1lgfIw0Drc65LpaxbUu49RCeJgMew547meSJttnUqz7c/edit) +- Implementation: `sdk/python/feast/on_demand_feature_view.py` diff --git a/docs/adr/ADR-0004-entity-join-key-mapping.md b/docs/adr/ADR-0004-entity-join-key-mapping.md new file mode 100644 index 00000000000..e6d09df4f34 --- /dev/null +++ b/docs/adr/ADR-0004-entity-join-key-mapping.md @@ -0,0 +1,78 @@ +# ADR-0004: Entity Join Key Mapping + +## Status + +Accepted + +## Context + +Multiple different entity keys in the source data may need to map onto the same entity from the feature data table during a join. For example, `spammer_id` and `reporter_id` may both need the `years_on_platform` feature from a table keyed by `user_id`. + +Without entity join key mapping: + +- Users had to rename columns in their entity dataframe to match the feature view's join key before retrieval. +- It was impossible to join a feature view twice on two different columns in the entity data (e.g., getting user features for both `spammer_id` and `reporter_id` in the same query). + +### Example + +Entity source data: + +| spammer_id | reporter_id | timestamp | +|------------|-------------|------------| +| 2 | 8 | 1629909366 | +| 1 | 2 | 1629909323 | + +Desired joined data should include `spammer_feature_a` and `reporter_feature_a`, both sourced from the same `user` feature view but joined on different keys. + +## Decision + +Implement join key overrides using a `with_join_key_map()` method on feature views, combined with `with_name()` for disambiguation. This was **Option 8b** from the RFC. + +### API + +```python +abuse_feature_service = FeatureService( + name="my_abuse_model_v1", + features=[ + user_features + .with_name("reporter_features") + .with_join_key_map({"user_id": "reporter_id"}), + user_features + .with_name("spammer_features") + .with_join_key_map({"user_id": "spammer_id"}), + ], +) +``` + +### Key Decisions + +- **Query-time mapping** rather than registration-time. This provides flexibility since the same feature view can be used with different mappings in different contexts. +- **Join key level mapping** rather than entity-level mapping. While entity-level mapping (Option 10) better preserves abstraction boundaries, join key mapping is more flexible and doesn't require registering additional entities. +- **`with_name()` required** when using the same feature view multiple times to avoid output column name collisions. If omitted, a name collision error is raised. +- **Mapping overwrites wholly**: specifying a mapping replaces the default join behavior entirely. If you want the original join key included, it must be explicitly listed. + +### Implementation + +- **Offline (historical) retrieval**: After feature subtable cleaning and dedup, entity columns are renamed based on the mapping before the join. +- **Online retrieval**: Shadow entity keys are translated to the original join key for the online store lookup, then results are remapped to the shadow entity names. +- The `join_key_map` is stored on `FeatureViewProjection` and flows through both online and offline retrieval paths. + +## Consequences + +### Positive + +- Users can join the same feature view on different entity columns in a single query. +- No need to register additional entities or manually rename columns before retrieval. +- Works consistently across both online and offline retrieval. +- Feature view definitions remain clean and reusable. + +### Negative + +- Adds complexity to the retrieval path with column renaming logic. +- Users must remember to use `with_name()` to avoid collisions when joining the same feature view multiple times. + +## References + +- Original RFC: [Feast RFC-023: Shadow Entities Mapping](https://docs.google.com/document/d/1TsCwKf3nVXTAfL0f8i26jnCgHA3bRd4dKQ8QdM87vIA/edit) +- GitHub Issue: [#1762](https://github.com/feast-dev/feast/issues/1762) +- Implementation: `sdk/python/feast/feature_view.py` (`with_join_key_map` method), `sdk/python/feast/feature_view_projection.py` (`join_key_map` field) diff --git a/docs/adr/ADR-0005-stream-transformations.md b/docs/adr/ADR-0005-stream-transformations.md new file mode 100644 index 00000000000..8d2e97f61c4 --- /dev/null +++ b/docs/adr/ADR-0005-stream-transformations.md @@ -0,0 +1,93 @@ +# ADR-0005: Stream Transformations + +## Status + +Accepted + +## Context + +Feast supported batch features well but lacked in-house support for pull-based stream ingestion or registered stream transformations. While Kafka and Kinesis data sources could be registered, users had to either: + +- Write a custom Provider to launch ingestion jobs outside the Feast environment. +- Manually push stream data into the online store via the Push API. + +The stream transformation pipeline existed entirely outside of Feast, making it harder to track, version, and manage streaming features. + +## Decision + +Introduce a `StreamFeatureView` and a `StreamProcessor` interface to provide a standardized pipeline for ingesting and transforming stream data. + +### Stream Feature View + +```python +from feast import StreamFeatureView, Entity, Field, Aggregation +from feast.types import Float32 + +@stream_feature_view( + entities=[entity], + ttl=timedelta(days=30), + owner="test@example.com", + online=True, + schema=[Field(name="dummy_field", dtype=Float32)], + description="Stream feature view with aggregations", + aggregations=[ + Aggregation(column="dummy_field", function="max", time_window=timedelta(days=1)), + Aggregation(column="dummy_field2", function="count", time_window=timedelta(days=24)), + ], + timestamp_field="event_timestamp", + mode="spark", + source=stream_source, +) +def pandas_view(pandas_df): + df = pandas_df.transform(lambda x: x + 10, axis=1) + return df +``` + +### Stream Processor + +The `StreamProcessor` is a pluggable interface for stream engines (Spark, Flink, etc.): + +```python +class StreamProcessor(ABC): + sfv: StreamFeatureView + data_source: DataSource + + def ingest_stream_feature_view(self) -> None: ... + def _ingest_stream_data(self) -> StreamTable: ... + def _construct_transformation_plan(self, table: StreamTable) -> StreamTable: ... + def _write_to_online_store(self, table: StreamTable) -> None: ... +``` + +### Unified Push API + +A unified push API was introduced to allow pushing features to both online and offline stores, supporting the Kappa architectural approach to streaming. + +### Aggregations + +Built-in aggregation functions: `sum`, `count`, `mean`, `max`, `min`. Aggregations use full aggregation with RocksDB for the initial implementation, keeping it simple while reducing request-time latency. + +### Key Decisions + +- **Full aggregations** chosen over partial aggregations for simplicity and lower request-time latency, using RocksDB to handle memory pressure. +- **Single time window restriction** for initial release; aggregations across different time windows (stream joins) add significant complexity. +- **User-managed ingestion**: Users handle their own ingestion jobs using the StreamProcessor interface with their preferred streaming engine. + +## Consequences + +### Positive + +- Streaming features can be registered and tracked in the Feast registry alongside batch features. +- UDFs for stream transformations are versioned with the feature view definition. +- The pluggable StreamProcessor interface supports multiple streaming engines. +- Unified Push API enables backfilling streaming features to the offline store. + +### Negative + +- Users must implement their own StreamProcessor for their streaming engine. +- Aggregation support is limited to basic functions in the initial release. +- Stream joins across different time windows are not supported. + +## References + +- Original RFC: [Feast RFC-036: Stream Transformations](https://docs.google.com/document/d/1Onjy-kiRlHt0USw5ggHu40hpezw1AV8KsgJAh46LoNY/edit) +- Implementation: `sdk/python/feast/stream_feature_view.py` diff --git a/docs/adr/ADR-0006-kubernetes-operator.md b/docs/adr/ADR-0006-kubernetes-operator.md new file mode 100644 index 00000000000..6969673d98d --- /dev/null +++ b/docs/adr/ADR-0006-kubernetes-operator.md @@ -0,0 +1,92 @@ +# ADR-0006: Kubernetes Operator + +## Status + +Accepted + +## Context + +As the Feast project grew, deploying a fully functional Feature Store in a production-like manner became increasingly difficult. Existing installers required many manual operations that led to configuration errors. Users needed a simpler way to install and maintain Feature Store environments, especially with features like RBAC. + +The existing Helm-based operator had limitations in handling complex installation requirements. A more capable operator was needed to manage the full lifecycle of Feast deployments on Kubernetes. + +## Decision + +Build a **Go Operator** using the `operator-sdk` framework with a cluster-scoped controller and a namespaced `FeatureStore` Custom Resource Definition (CRD). + +### FeatureStore CRD + +The operator manages Feast through a single CRD that defines the entire feature store deployment: + +```yaml +apiVersion: feast.dev/v1alpha1 +kind: FeatureStore +metadata: + name: example + namespace: feast +spec: + feastProject: my-project + auth: + kubernetes: + roles: [reader, writer] + services: + registry: + replicas: 1 + persistence: + file: + pvc: + capacity: 5Gi + onlineStore: + replicas: 2 + persistence: + postgresql: + secretRef: online-store-creds + offlineStore: + replicas: 3 + feastApplyJob: + configMapRef: feast-definitions +``` + +### Architecture + +- **Operator deploys Feast services** (Registry, Online Store, Offline Store) as defined in the CR. +- **Operator generates `feature_store.yaml`** from the CR spec, including only relevant sections for each server type. +- **Client ConfigMap** is created automatically for remote connectivity. +- **`feast apply` Job** can be triggered from a ConfigMap or Git repo to initialize the registry. +- **CR status.applied** is the single source of truth for the deployed state. + +### Key Decisions + +- **Go over Python**: Go is better suited for Kubernetes operators. Python is great for ML work but not for cloud-native infrastructure management. +- **Single CRD** (`FeatureStore`) instead of separate CRDs per service type. All services are part of a functioning Feature Store and should be managed together. +- **Operator manages `feature_store.yaml`** entirely to ensure consistency and validation (e.g., `remote` types are only used where appropriate). +- **Data store deployments are out of scope**: The operator assumes data stores are pre-provisioned and accessible. +- **Deprecation of Helm-based operator**: The existing Helm-based operator is deprecated in favor of the Go operator. + +### Persistence Options + +- **Default**: Ephemeral file-based stores. +- **File with PVC**: For clusters supporting persistent volumes. +- **PostgreSQL**: Via referenced Kubernetes secrets for credentials. + +## Consequences + +### Positive + +- Simplified, standardized deployment of Feast on Kubernetes. +- Full lifecycle management including RBAC, metrics, and feature store initialization. +- Supports multiple Feature Store deployments in a single cluster without conflict. +- Proper validation and consistency enforcement through the operator reconciliation loop. +- Deployable with kustomize; compatible with OLM and OperatorHub. + +### Negative + +- Requires Kubernetes as the deployment platform. +- Data store management is left to users (intentionally out of scope). +- Initial release supports limited persistence backends; additional stores added incrementally. + +## References + +- Original RFC: [Feast RFC-042: Operator](https://docs.google.com/document/d/1vGKMizf3_14IyiF_W_Ik7CR03joFkQfzbKT0jH4PZJM/edit) +- GitHub Issue: [#4561](https://github.com/feast-dev/feast/issues/4561) +- Implementation: `infra/feast-operator/` diff --git a/docs/adr/ADR-0007-unified-feature-transformations.md b/docs/adr/ADR-0007-unified-feature-transformations.md new file mode 100644 index 00000000000..86e77c0a73c --- /dev/null +++ b/docs/adr/ADR-0007-unified-feature-transformations.md @@ -0,0 +1,91 @@ +# ADR-0007: Unified Feature Transformations and Feature Views + +## Status + +Accepted + +## Context + +In Feast, the `OnDemandFeatureView` name did not clearly convey that transformations execute at read time. The term "On Demand" was ambiguous about when the transformation occurs. Additionally, there were multiple feature view types (`FeatureView`, `BatchFeatureView`, `StreamFeatureView`, `OnDemandFeatureView`) with: + +- **Excessive logic** handling each type throughout the codebase (e.g., `FeatureStore.apply()`, `get_online_features`, `get_historical_features`). +- **Redundant fields** across the different feature view classes. +- **Unclear transformation timing**: when transformations occur, where they execute, and how materialization works varied by type. + +| Type | When Transformation Occurs | Where | Materialization | +|------|---------------------------|-------|-----------------| +| FeatureView | Undefined | Outside Feast | Feature Server or Batch | +| BatchFeatureView | Batch process | Offline Store (external) | Feature Server or Batch | +| StreamFeatureView | Streaming process | Stream Processor (external) | Stream Processor | +| OnDemandFeatureView | On Read | Feature Server | Feature Server | +| OnDemandFeatureView (writes) | On Write | Feature Server | Feature Server | + +## Decision + +Unify Batch, Streaming, and On-Demand feature views into a single `FeatureView` class with a `@transform` decorator that makes execution timing explicit. + +### Transformation Types + +```python +from enum import Enum + +class FeastTransformation(Enum): + NONE = 0 # No transformations (default) + ON_READ = 1 # Current On Demand Feature View behavior + ON_WRITE = 2 # Transformations at write time + BATCH = 3 # Batch processing transformations + STREAMING = 4 # Stream transformations +``` + +### Proposed API + +```python +@transform( + type=FeastTransformation.ON_WRITE, + schema=[...], + entity=[...], + sources=[...], + mode="python", # pandas, substrate, etc. + engine="feature_server", # Spark, Snowflake, etc. + orchestrator=None, # Airflow, KFP, etc. +) +def my_feature_view(inputs): + outputs = { + "my_feature": [v * 1.0 for v in inputs["input_feature_1"]], + } + return outputs +``` + +### Key Decisions + +- **Single class** rather than defining a V2 class. A breaking change in stages is preferred to avoid rework for Java and Go servers. +- **Explicit transformation timing** via enum rather than implicit behavior based on class type. +- **Staggered release**: Ship a version supporting both old and new APIs with deprecation logging, then ship a breaking version. +- **Five clear primitives**: Entities, DataSources, Fields, FeatureViews, and FeatureServices. + +### Current Implementation Status + +The codebase currently uses a `transformation()` decorator and `TransformationMode` enum (with modes like PANDAS, PYTHON, SPARK, RAY, SQL, SUBSTRAIT) in `sdk/python/feast/transformation/base.py`. The legacy `OnDemandFeatureView`, `StreamFeatureView`, and `BatchFeatureView` classes still exist during the migration period. + +## Consequences + +### Positive + +- Clearer, more explicit API that makes transformation timing obvious. +- Removes excessive handling of each feature view type throughout the codebase. +- Eliminates redundant field definitions across multiple classes. +- Establishes five clear primitives for the Feast data model. +- FeatureViews can declare other FeatureViews as data sources, enabling computational graphs. + +### Negative + +- Requires a migration period with both old and new APIs supported. +- Breaking change that needs careful coordination across Python, Java, and Go components. +- Users must update existing feature view definitions during migration. + +## References + +- Original RFC: [Feast RFC-043: Unify Feature Transformations and Feature Views](https://docs.google.com/document/d/1KXCXcsXq1bUvbSpfhnUjDSsu4HpuUZ5XiZoQyltCkvo/edit) +- GitHub Issue: [#4584](https://github.com/feast-dev/feast/issues/4584) +- Related RFCs: RFC-021 (On-Demand Transformations), RFC-036 (Stream Transformations) +- Implementation: `sdk/python/feast/transformation/base.py` diff --git a/docs/adr/ADR-0008-feature-view-versioning.md b/docs/adr/ADR-0008-feature-view-versioning.md new file mode 100644 index 00000000000..f6f63922077 --- /dev/null +++ b/docs/adr/ADR-0008-feature-view-versioning.md @@ -0,0 +1,129 @@ +# ADR-0008: Feature View Versioning + +## Status + +Accepted + +## Context + +When a feature view's schema changed in Feast, the old definition was silently overwritten. This created several problems: + +1. **No audit trail**: Teams couldn't answer "what did this feature view look like last week?" or "who changed the schema and when?" +2. **No safe rollback**: If a schema change broke a downstream model, there was no way to revert without manually reconstructing the previous definition. +3. **No multi-version serving**: During migrations, teams often need to serve both old and new schemas simultaneously (e.g., model A uses v1 features, model B uses v2 features). This required creating entirely separate feature views. + +## Decision + +Add automatic version tracking to Feast feature views. Every time `feast apply` detects a schema or UDF change, a versioned snapshot is saved to the registry. + +### Core Concepts + +- **Version number**: Auto-incrementing integer (v0, v1, v2, ...) for each schema-significant change. +- **Version snapshot**: Serialized copy of the full feature view proto stored in the registry. +- **Version pin**: Setting `version="v2"` on a feature view replaces the active definition with the v2 snapshot. +- **Version-qualified ref**: The `@v` syntax (e.g., `driver_stats@v2:trips_today`) for reading from a specific version. + +### What Triggers a New Version + +Only schema and UDF changes create new versions: + +- Adding, removing, or retyping feature columns. +- Changing entities or entity columns. +- Changing UDF code (StreamFeatureView, OnDemandFeatureView). + +Metadata-only changes (description, tags, owner, TTL) update the active definition in place without creating a version. + +### Version History Is Always-On + +Version history tracking is lightweight registry metadata (serialized proto snapshots). There is no performance cost to the online path. Every `feast apply` that changes a feature view will: + +- Record a version snapshot. +- Support `feast feature-views list-versions ` to list history. +- Support `registry.list_feature_view_versions(name, project)` programmatically. +- Support `registry.get_feature_view_by_version(name, project, version_number)`. + +### Online Versioning Is Opt-In + +Version-qualified reads from separate online store tables are gated behind a config flag: + +```yaml +registry: + path: data/registry.db + enable_online_feature_view_versioning: true +``` + +When enabled, `driver_stats@v2:trips_today` reads from a version-specific table (`project_driver_stats_v2`). When disabled (default), using `@v` refs raises a clear error. + +### Version Pinning + +```python +driver_stats = FeatureView( + name="driver_stats", + entities=[driver], + schema=[...], + source=my_source, + version="v2", # revert to v2's definition +) +``` + +Safety: The user's definition (minus the version field) must match the currently active definition. If both schema and version pin are changed, `feast apply` raises `FeatureViewPinConflict`. + +### Staged Publishing (`--no-promote`) + +The `--no-promote` flag saves a version snapshot without updating the active definition, enabling phased rollouts: + +```bash +# Stage the new version +feast apply --no-promote + +# Populate the v2 online table +feast materialize --views driver_stats --version v2 ... + +# Migrate consumers one at a time (using @v2 refs) + +# Promote v2 as the default +feast apply +``` + +### Materialization + +Each version's data lives in its own online store table. By default, `feast materialize` targets the active version. A `--version` flag targets specific versions: + +```bash +feast materialize --views driver_stats --version v1 2024-01-01T00:00:00 2024-01-15T00:00:00 +``` + +### Concurrency + +- **SQL registry**: Unique constraint on `(feature_view_name, project_id, version_number)` with retry logic for auto-increment races. +- **File registry**: Last-write-wins (pre-existing limitation). + +### Limitations + +- Online store coverage: Version-qualified reads are only on SQLite initially. +- Offline store versioning is out of scope. +- No mechanism to prune old versions. +- Cross-version joins in `get_historical_features` are not supported. + +## Consequences + +### Positive + +- Full audit trail of schema changes for every feature view. +- Safe rollback capability through version pinning. +- Multi-version serving enables gradual migrations without creating duplicate feature views. +- Always-on history tracking with zero performance cost to the online path. +- Staged publishing supports safe, phased rollouts of breaking changes. + +### Negative + +- Version-qualified online reads are initially limited to SQLite. +- Offline versioning is not supported, creating a gap for reproducing historical training data. +- No version pruning mechanism may lead to unbounded growth in long-lived feature views. +- Concurrency handling differs between SQL and file registries. + +## References + +- Original RFC: [Feature View Versioning](https://docs.google.com/document/d/1OE-S-10kdBwxWHG4TI_zdg_VAQNST38IkSVmQkCfjeQ/edit) +- Pull Request: [#6101](https://github.com/feast-dev/feast/pull/6101) +- Implementation: `sdk/python/feast/feature_view.py` (version fields), `docs/adr/feature-view-versioning.md` diff --git a/docs/adr/ADR-0009-contribution-extensibility.md b/docs/adr/ADR-0009-contribution-extensibility.md new file mode 100644 index 00000000000..4e5c94a01de --- /dev/null +++ b/docs/adr/ADR-0009-contribution-extensibility.md @@ -0,0 +1,87 @@ +# ADR-0009: Contribution and Extensibility Architecture + +## Status + +Accepted + +## Context + +A design goal for Feast is that it should be extensible and easy to use with different technologies (storage, compute, deployment environments). After the launch of Feast 0.10, community interest grew in adding support for new online stores (Dynamo, Redis, Cassandra, HBase) and custom compute engines (Dataflow, Flink). + +However, several problems existed: + +1. **No decoupled interfaces**: Online stores were not decoupled from providers, so new online store contributions required building entire new providers. +2. **No contrib path**: Contributors had no way to extend the core codebase with experimental code while benefiting from the test suite. +3. **No plugin system**: No clearly defined plugin points for Providers, Offline Stores, Online Stores, and Compute, where code could live outside the Feast codebase. + +## Decision + +Introduce a three-tier extensibility architecture: **Interfaces**, **Contrib**, and **Plugins**. + +### Interfaces + +Create abstract base classes for `OnlineStore`, `OfflineStore`, and `Provider` so that different providers can reuse functionality without reimplementing it: + +``` +Provider (top-level orchestrator) +├── OnlineStore (abstract) +├── OfflineStore (abstract) +└── Compute (future) +``` + +### Contrib Module + +Add a `contrib` module to the Feast SDK for community-contributed implementations: + +``` +feast/ +└── contrib/ + ├── compute/ + │ └── spark.py + ├── offline_stores/ + │ └── postgres.py + ├── online_stores/ + │ ├── cassandra.py + │ └── hbase.py + └── providers/ + └── azure.py +``` + +Contrib implementations are referenced by classpath in `feature_store.yaml`: + +```yaml +online_store: + type: feast.contrib.online_stores.hbase.HbaseOnlineStore +``` + +Each contrib module follows a convention: a `*Config` class for configuration and a `*Test` class for setup/teardown of test infrastructure (e.g., Docker containers). Contrib code is covered by CI but failures produce warnings only. + +### Plugins + +External Python packages can be imported and used from within Feast without merging code upstream: + +```yaml +provider: + type: my_company_feast.MyCompanyFeastProvider +``` + +The key difference: contrib code is covered by Feast's test suite; external plugins are not. + +## Consequences + +### Positive + +- Enabled a large ecosystem of community-contributed stores (Cassandra, HBase, Postgres, Spark, Trino, etc.). +- Teams can extend Feast without forking or merging code upstream. +- Clear separation between core, community-contributed, and external plugin code. +- Consistent testing patterns across all contrib implementations. + +### Negative + +- Contrib code may become unmaintained if original contributors disengage. +- Plugin interface requires careful versioning to avoid breaking external implementations. + +## References + +- Original RFC: [Feast RFC-014: Contribution Plan](https://docs.google.com/document/d/1MD0aS2_hGzd1tJ7DNjE3NgEtcuekh3O06OeQ9aavylY/edit) +- Implementation: `sdk/python/feast/infra/online_stores/`, `sdk/python/feast/infra/offline_stores/`, `sdk/python/feast/infra/contrib/` diff --git a/docs/adr/ADR-0010-vector-database-integration.md b/docs/adr/ADR-0010-vector-database-integration.md new file mode 100644 index 00000000000..e5a40e22c8e --- /dev/null +++ b/docs/adr/ADR-0010-vector-database-integration.md @@ -0,0 +1,89 @@ +# ADR-0010: Vector Database Integration for LLM/RAG Support + +## Status + +Accepted + +## Context + +Feast is an abstraction layer for ML infrastructure that integrates with diverse online and offline data sources. With the rise of Large Language Model (LLM) applications, particularly Retrieval Augmented Generation (RAG), there was a need to support: + +- Transforming document data into embeddings (features). +- Loading embeddings into vector-capable databases (online stores). +- Retrieving the most similar documents given a query embedding at serving time. + +These capabilities align naturally with Feast's existing concepts of feature views, materialization, and online serving, but required a new retrieval interface for similarity search. + +## Decision + +Extend Feast's online store interface with a `retrieve_online_documents` method that performs approximate nearest neighbor (ANN) search. + +### Core Design + +Treat embeddings/vectors as features within existing feature views. Add a new retrieval interface to online stores: + +```python +class OnlineStore: + def retrieve_online_documents( + self, + config: RepoConfig, + table: FeatureView, + requested_feature: str, + embedding: List[float], + top_k: int, + distance_metric: Optional[str] = None, + ) -> List[Tuple[Optional[datetime], Optional[Dict[str, ValueProto]]]]: + ... +``` + +### Supported Stores + +Online stores that implement vector search: + +- **PostgreSQL with pgvector**: ANN search using HNSW and IVFFlat indexes. +- **Elasticsearch**: Vector similarity search with hybrid search capabilities. +- **Milvus**: Dedicated vector database with large-scale ANN support. +- **Qdrant**: Vector similarity search engine. +- **SQLite with sqlite-vec**: Lightweight local vector search. + +### Usage + +```python +from feast import FeatureStore + +store = FeatureStore(".") + +# Retrieve top-k similar documents +results = store.retrieve_online_documents( + feature="document_embeddings:embedding", + query=query_embedding, + top_k=5, +) +``` + +### Key Decisions + +- **Embeddings as features**: Rather than introducing a new primitive, embeddings are stored as features in existing feature views. This reuses Feast's materialization, versioning, and serving infrastructure. +- **Interface on OnlineStore**: The `retrieve_online_documents` method is added to the `OnlineStore` interface, allowing each store implementation to use its native vector search capabilities. +- **Incremental store support**: Not all online stores support vector search. Stores that don't implement the method raise a clear error. New stores are added based on community demand and contributions. + +## Consequences + +### Positive + +- Feast naturally extends from MLops to LLMops/RAG use cases. +- Reuses existing Feast concepts (feature views, materialization, online stores) without introducing new primitives. +- Multiple vector database backends supported, giving users flexibility. +- RAG applications can use Feast as a unified feature and document store. + +### Negative + +- Vector search capabilities vary significantly across stores (e.g., hybrid search in Elasticsearch vs. pure ANN in others). Feast's interface targets the lowest common denominator. +- Embedding pipeline (encoding documents into vectors) is not fully managed by Feast; users handle this externally. + +## References + +- Original RFC: [Feast RFC-040: Document Store / LLM Extension](https://docs.google.com/document/d/18IWzLEA9i2lDWnbfbwXnMCg3StlqaLVI-uRpQjr_Vos/edit) +- GitHub Issue: [#3965](https://github.com/feast-dev/feast/issues/3965) +- Implementation: Online store implementations in `sdk/python/feast/infra/online_stores/` +- Examples: `examples/rag/`, `examples/online_store/pgvector_tutorial/`, `examples/online_store/milvus_tutorial/` diff --git a/docs/adr/ADR-0011-data-quality-monitoring.md b/docs/adr/ADR-0011-data-quality-monitoring.md new file mode 100644 index 00000000000..18318f9964a --- /dev/null +++ b/docs/adr/ADR-0011-data-quality-monitoring.md @@ -0,0 +1,90 @@ +# ADR-0011: Data Quality Monitoring + +## Status + +Accepted + +## Context + +Data quality issues can significantly impact ML model performance. Several complex data problems needed to be addressed: + +- **Data consistency**: New training datasets can differ significantly from previous datasets, potentially requiring changes in model architecture. +- **Upstream pipeline bugs**: Bugs in upstream pipelines can cause invalid values to overwrite existing valid values in an online store. +- **Training/serving skew**: Distribution shift between training and serving data can decrease model performance. + +Feast needed a mechanism to validate data at retrieval time to catch these issues before they affect model training or serving. + +## Decision + +Introduce a Data Quality Monitoring (DQM) module that validates datasets against user-curated rules, initially targeting historical retrieval (training dataset generation). + +### Design + +The validation process uses a **reference dataset** and a **profiler** pattern: + +1. User prepares a reference dataset (saved from a known-good historical retrieval). +2. User defines a profiler function that produces a profile (set of expectations) from a dataset. +3. Validation is performed by comparing the tested dataset against the reference profile. + +### Integration with Great Expectations + +The initial implementation uses [Great Expectations](https://greatexpectations.io/) as the validation engine: + +```python +from feast.dqm.profilers.ge_profiler import ge_profiler +from great_expectations.dataset import Dataset +from great_expectations.core.expectation_suite import ExpectationSuite + +@ge_profiler +def my_profiler(dataset: Dataset) -> ExpectationSuite: + dataset.expect_column_max_to_be_between("column", 1, 2) + dataset.expect_column_values_to_not_be_null("important_feature") + return dataset.get_expectation_suite() +``` + +### Usage + +Validation is triggered during historical feature retrieval via a `validation_reference` parameter: + +```python +from feast import FeatureStore + +store = FeatureStore(".") + +job = store.get_historical_features(...) +df = job.to_df( + validation_reference=store + .get_saved_dataset("my_reference_dataset") + .as_reference(profiler=my_profiler) +) +``` + +If validation fails, a `ValidationFailed` exception is raised with details for all expectations that didn't pass. If validation succeeds, the materialized dataset is returned normally. + +### Key Decisions + +- **Profiler-based approach**: Users define their own validation rules via profiler functions rather than Feast prescribing fixed validation rules. +- **Great Expectations integration**: Leverages an established data validation framework rather than building custom validation logic. +- **Validation at retrieval time**: Validation is performed when datasets are materialized (`.to_df()` or `.to_arrow()`), not during ingestion. +- **ValidationReference as a registry object**: Saved datasets and their validation references are stored in the Feast registry for reuse. + +## Consequences + +### Positive + +- Users can detect data quality issues before they affect model training. +- Flexible profiler pattern allows custom validation rules per use case. +- Integration with Great Expectations provides a rich set of built-in expectations. +- Reference datasets provide a baseline for detecting data drift. + +### Negative + +- Currently limited to historical retrieval; online store write/read validation is planned but not yet implemented. +- Dependency on Great Expectations adds to the install footprint (optional via `feast[ge]`). +- Automatic profiling capabilities are limited; manual expectation crafting is recommended. + +## References + +- Original RFC: Feast RFC-027: Data Quality Monitoring (Google Drive shortcut no longer accessible) +- Implementation: `sdk/python/feast/dqm/`, `sdk/python/feast/saved_dataset.py` +- Documentation: [Data Quality Monitoring](../reference/dqm.md) diff --git a/docs/adr/ADR-TEMPLATE.md b/docs/adr/ADR-TEMPLATE.md new file mode 100644 index 00000000000..084dc3d20f9 --- /dev/null +++ b/docs/adr/ADR-TEMPLATE.md @@ -0,0 +1,31 @@ +# ADR-XXXX: Title + +## Status + +Proposed | Accepted | Deprecated | Superseded + +## Context + +Describe the context and problem statement. What is the issue that motivated this decision? + +## Decision + +Describe the decision that was made. Include any relevant design details, API examples, or architecture diagrams. + +## Consequences + +### Positive + +- List positive outcomes of this decision. + +### Negative + +- List any trade-offs or negative outcomes. + +### Neutral + +- List any neutral observations. + +## References + +- Link to the original RFC, GitHub issues, or pull requests. diff --git a/docs/adr/README.md b/docs/adr/README.md new file mode 100644 index 00000000000..49e20f81885 --- /dev/null +++ b/docs/adr/README.md @@ -0,0 +1,37 @@ +# Architecture Decision Records (ADR) + +This directory contains Architecture Decision Records (ADRs) for the Feast project. ADRs document significant architectural decisions made during the development of Feast, providing context, rationale, and consequences for each decision. + +## What is an ADR? + +An Architecture Decision Record captures a single architectural decision, including the context in which it was made, the decision itself, and the expected consequences. ADRs serve as a historical record for current and future contributors to understand why the project is structured the way it is. + +## ADR Index + +| ADR | Title | Status | Original RFC | +|-----|-------|--------|-------------| +| [ADR-0001](ADR-0001-feature-services.md) | Feature Services | Accepted | RFC-015 | +| [ADR-0002](ADR-0002-component-refactor.md) | Component Refactor | Accepted | RFC-020 | +| [ADR-0003](ADR-0003-on-demand-transformations.md) | On-Demand Transformations | Accepted | RFC-021 | +| [ADR-0004](ADR-0004-entity-join-key-mapping.md) | Entity Join Key Mapping | Accepted | RFC-023 | +| [ADR-0005](ADR-0005-stream-transformations.md) | Stream Transformations | Accepted | RFC-036 | +| [ADR-0006](ADR-0006-kubernetes-operator.md) | Kubernetes Operator | Accepted | RFC-042 | +| [ADR-0007](ADR-0007-unified-feature-transformations.md) | Unified Feature Transformations and Feature Views | Accepted | RFC-043 | +| [ADR-0008](ADR-0008-feature-view-versioning.md) | Feature View Versioning | Accepted | Feature View Versioning RFC | +| [ADR-0009](ADR-0009-contribution-extensibility.md) | Contribution and Extensibility Architecture | Accepted | RFC-014 | +| [ADR-0010](ADR-0010-vector-database-integration.md) | Vector Database Integration for LLM/RAG Support | Accepted | RFC-040 | +| [ADR-0011](ADR-0011-data-quality-monitoring.md) | Data Quality Monitoring | Accepted | RFC-027 | + +## Creating a New ADR + +1. Copy the [ADR template](ADR-TEMPLATE.md) to a new file with the next sequential number. +2. Fill in all sections of the template. +3. Submit a pull request with the new ADR. +4. Once the RFC is finalized and approved, update the ADR status to "Accepted". + +## ADR Statuses + +- **Proposed**: The decision is under discussion. +- **Accepted**: The decision has been accepted and is being (or has been) implemented. +- **Deprecated**: The decision is no longer relevant due to changes in the project. +- **Superseded**: The decision has been replaced by a newer ADR. diff --git a/docs/rfcs/feature-view-versioning.md b/docs/adr/feature-view-versioning.md similarity index 100% rename from docs/rfcs/feature-view-versioning.md rename to docs/adr/feature-view-versioning.md diff --git a/docs/project/contributing.md b/docs/project/contributing.md index cded378951d..d79291b9aa8 100644 --- a/docs/project/contributing.md +++ b/docs/project/contributing.md @@ -22,9 +22,22 @@ PRs that are submitted by the general public need to be identified as `ok-to-tes See also [Making a pull request](development-guide.md#making-a-pull-request) for other guidelines on making pull requests in Feast. +## RFCs and Architecture Decision Records + +For substantial changes (new features, architecture changes, removing features), we use an RFC process. See the [governance document](../../community/governance.md#rfcs-process) for details. + +Once an RFC is finalized and approved, it should be recorded as an Architecture Decision Record (ADR) in the [`docs/adr/`](../adr/README.md) directory. This ensures that architectural decisions are version-controlled alongside the codebase and easily accessible to all contributors. + +To add a finalized RFC as an ADR: + +1. Copy the [ADR template](../adr/ADR-TEMPLATE.md) to a new file with the next sequential number. +2. Summarize the RFC's context, decision, and consequences. +3. Submit a pull request with the new ADR. + ## Resources - [Community](../community.md) for other ways to get involved with the community - [Development guide](development-guide.md) for tips on how to contribute - [Feast GitHub issues](https://github.com/feast-dev/feast/issues) to see what others are working on -- [Feast RFCs](https://drive.google.com/drive/u/0/folders/1msUsgmDbVBaysmhBlg9lklYLLTMk4bC3) for a folder of previously written RFCs \ No newline at end of file +- [Feast RFCs](https://drive.google.com/drive/u/0/folders/1msUsgmDbVBaysmhBlg9lklYLLTMk4bC3) for a folder of previously written RFCs +- [Architecture Decision Records](../adr/README.md) for documented architectural decisions \ No newline at end of file diff --git a/docs/reference/alpha-feature-view-versioning.md b/docs/reference/alpha-feature-view-versioning.md index fbfc733afc6..c9a0a998915 100644 --- a/docs/reference/alpha-feature-view-versioning.md +++ b/docs/reference/alpha-feature-view-versioning.md @@ -206,7 +206,7 @@ Version history tracking in the registry (listing versions, pinning, `--no-promo ## Full Details -For the complete design, concurrency semantics, and feature service interactions, see the [Feature View Versioning RFC](../rfcs/feature-view-versioning.md). +For the complete design, concurrency semantics, and feature service interactions, see the [Feature View Versioning RFC](../adr/feature-view-versioning.md). ## Naming Restrictions