Cymbal Shops, a thriving online retailer, found itself at a crossroads. Its existing e-commerce platform, while functional, was struggling to keep up with rapid growth and increasing customer demands. Additionally, a recent mandate from the board to implement generative AI capabilities added a new layer of complexity. The company's CTO recognized the need for a significant transformation to stay competitive and relevant in the fast-paced industry. The CTO has looked to Google Cloud to provide a solution to the organization's business and technology strategy going forward.
Cymbal Shops chose Google Cloud as its strategic partner for this transformative journey. The company was particularly drawn to Google Cloud's expertise in Gen AI, its commitment to open source technologies, and its robust suite of managed services.
In this gHack we'll help Cymbal bring this plan forward as much as possible, primarily focusing on the migration of the legacy application MySQL database to Spanner. From that point we'll create BigQuery analytics datasets to handle both historical and live data using federation, use advanced generative AI to enhance the existing dataset with descriptions and images, and finally implement semantic search on this dataset.
During the process we'll learn about
- Cloud Spanner and how to migrate from a legacy database
- BigQuery for analytics workloads on federated data from Spanner
- Application integration for automating certain tasks
- GenAI capabilities on BigQuery
- Challenge 1: Migration
- Challenge 2: Federation
- Challenge 3: Automation
- Challenge 4: Generating text and embeddings
- Challenge 5: Semantic search
- Challenge 6: Generating images
- Basic knowledge of GCP
- Basic knowledge of Python
- Basic knowledge of SQL
- Access to a GCP environment
- Murat Eken
- Gino Filicetti
This first challenge is all about migrating the legacy database containing the historical store data to Cloud Spanner, an always on database with virtually unlimited scale.
The legacy MySQL database has already been migrated to Google Cloud as a MySQL instance in your project. Find it and explore its details & contents. The user name and password for the database should be provided to you by your coach.
Create a new Spanner Enterprise edition instance called onlineboutique with 100 Processing Units in the same region as the MySQL instance. Then create a new Spanner database with the name ecom in the new instance. Once the Spanner database is created, migrate the ecom schema and data from the MySQL instance to the new Spanner database using the Spanner Migration Tool. Make sure to choose POC Migration as the migration type.
Note
Once the migration starts, it will take roughly 15 minutes to complete. While the migration is running, you can already start looking into the next challenge, but you'll need the migration to be completed in order to complete the next challenge.
-
There is a new Cloud Spanner instance,
onlineboutiquewith 100 PU and a new database,ecom. -
All the data from the MySQL database
ecomis migrated to the new Spanner database. -
Verify that the following tables with the corresponding row counts exist in the new Spanner database:
Table #Rows distribution_centers 10 events 2438862 inventory_items 494254 order_items 182905 orders 125905 products 29120 users 100000
- If there's nothing mentioned in the instructions about a parameter, stick to the defaults (this applies to all of the challenges).
- In principle you could install the Spanner Migration Tool on any VM, but the Cloud Shell is the easiest option.
- The Spanner Migration Tool will need network access to the MySQL database, so you'll need to add the IP address of the Cloud Shell (or any other VM where you're running the tool from) to the authorized networks of the MySQL instance.
Although you could run analytics queries also on Spanner, it's mainly an OLTP database. We'll export historical application data to BigQuery for advanced analytics and as a foundation for GenAI use cases. Tight integrations between BigQuery and the source databases can then be used for unified querying across analytics and transactional databases, and AI/ML integrations can then be leveraged to enhance customer insights
Create a new BigQuery dataset cymbal_analytics in the multi-region US. Create a new external (federated) dataset spanner_external_dataset linked to the Spanner ecom database you've created in the previous challenge.
Copy each table from the Spanner database to the newly created BigQuery dataset.
-
All the data from the Spanner database
ecomis exported to the new BigQuery dataset. -
Verify that the following tables with the corresponding row counts exist in the new BigQuery dataset:
Table #Rows distribution_centers 10 events 2438862 inventory_items 494254 order_items 182905 orders 125905 products 29120 users 100000
- Creating BigQuery datasets
- Creating Spanner external datasets in BigQuery
- Creating BigQuery tables from a query result
As all of the historical data has now been copied over to BigQuery, we can optimize the Spanner environment by deleting the data that is more than a year old from the Spanner database, as that is only needed for analytics. However Spanner has some limits on the number of mutations that can be done in a single transaction, so we'll have to batch the removal of the data and script it. There's a plethora of different tools but we'll use Application Integration for that purpose.
Application Integration is an Integration-Platform-as-a-Service (iPaaS) solution in Google Cloud that offers a comprehensive set of core integration tools to connect and manage the multitude of applications (Google Cloud services and third-party SaaS) and data required to support various business operations.
We'll use the integrationcli tool to configure the pipeline for removing the historical data from Spanner. Go ahead and follow the instructions to install it.
We've already prepared the Application Integration pipeline for you here, download that to the VM where you installed the integrationcli and publish it. Use the same region as the Spanner instance and make sure to set the environment name for the scaffolding to dev and grant permissions.
Once the Application Integration is available in the Google Cloud Console, open it and run it by clicking on the Test button, and choosing the Delete Order Items task (keep the default parameters).
Note
The pipeline will delete the rows in multiple batches asynchronously, give it a few minutes before verifying that all the historical data is removed.
- The Application Integration pipeline has been successfully run.
- All the historical data (anything older than
2024-01-31) has been removed from the Spanner database.
- In principle you could install the
integrationclitool on any VM, but the Cloud Shell is the easiest option. - If you've installed the tool on Cloud Shell, you can use the
--default-tokenoption to authenticate.
In this challenge we'll create enhanced product descriptions and text embeddings for the products table in BigQuery to prepare for semantic search.
Add the following two columns product_description (STRING) and product_description_embeddings (ARRAY<FLOAT64>) to the products table in BigQuery. Using an LLM from BigQuery, generate product descriptions based on the product name, brand, category, department and retail_price information for at least 100 products and store that in the new product_descriptions column.
Note
We're only generating the descriptions for 100 products, as doing it for the complete dataset would take too long.
Then using an embeddings model again from BigQuery, generate embeddings for the product_description column (for the 100 product descriptions that have been generated) and store it in the new product_description_embeddings column.
- There are two new columns in the BigQuery
productstable:product_descriptionsandproduct_description_embeddings. - The column
product_descriptioncontains the LLM generated product descriptions for at least 100 products. - The column
product_description_embbedingscontains the embeddings for the product descriptions.
- Modifying table schemas in BigQuery
- Generating text using an LLM from BigQuery
- Generating text embeddings from BigQuery
In this challenge we'll use reverse ETL to transfer the product descriptions and embeddings over to Spanner. Once the information is available in Spanner we'll implement semantic search through Spanner's embeddings capabilities.
Before we start copying the data from BigQuery to Spanner, let's complete the product descriptions for the whole table. We've prepared a Worfklow called prep-semantic-search, look it up, execute it with the parameter embeddings_model_name value set to the name of the embeddings model that was created in the previous challenge. Once this workflow succesfully completes, all products should have descriptions and embeddings.
Now, add the following two columns product_description of type STRING(MAX) and product_description_embeddings of type ARRAY<FLOAT64> to the products table in Spanner. Export the data from BigQuery to Spanner through reverse ETL, and create the same embeddings model in Spanner.
Using the embeddings model in Spanner do a search for the query Luxury items for women and return the top 5 items.
-
There are two new columns in the Spanner
productstable:product_descriptionsandproduct_description_embeddings. -
The column
product_descriptioncontains the LLM generated product descriptions for all products. -
The column
product_description_embbedingscontains the embeddings for the product descriptions for all products. -
The semantic search for Luxury items for women on Spanner returns the following products:
id name department 13981 Super Soft Black 3m Insulated Leather Women's Gloves Women 13896 Black Fleur Di Lis Rhinestone Western Bling Belt Women 13869 WARMEN Luxury Gift Women Genuine Nappa Soft Leather Lined Long Elbow Gloves Women 13967 Women's Italian Leather Gloves Lined in Cashmere in Many Colors. Simple By Solo Classe Women 14085 Luxury Lane Women's Cashmere Lined Lambskin Leather Long Gloves in Black Chocolate or Tobacco Women
- Executing a Workflow
- Making schema updates in Spanner
- Exporting data to Spanner (reverse ETL)
- Text embeddings in Spanner
Now we have product descriptions, we can generate images based on these descriptions using GenAI. This will involve running a Vertex AI Pipeline to orchestrate a few things, such as iterating over the data, preparing the prompt based on the product description, running Imagen to generate the images, pick one of the results (randomly), store the image in Cloud Storage and record the URI in BigQuery.
Create a new Cloud Storage bucket and name it {YOUR PROJECT ID}-images. Update the BigQuery products table to include image_uri and image_url columns.
We've already prepared the code for a Vertex AI Pipeline, download it from here and install its dependencies. Then run the code, it will create a yaml file. Navigate to Vertex AI Pipelines and Create run with that yaml file, provide the required parameters and submit the job.
- There's a new Cloud Storage bucket called
{YOUR PROJECT ID}-images. - The BigQuery
productstable has the new columnsimage_uriandimage_url. - The provided Vertex AI Pipeline has been successfully run.
- There are product images generated for the
productsand the corresponding urls are in the BigQuery table.
