Cloud Blog: A practical guide to synthetic data generation with Gretel and BigQuery DataFrames

Source URL: https://cloud.google.com/blog/products/data-analytics/synthetic-data-generation-with-gretel-and-bigquery-dataframes/
Source: Cloud Blog
Title: A practical guide to synthetic data generation with Gretel and BigQuery DataFrames

Feedly Summary: In our previous post, we explored how integrating Gretel with BigQuery DataFrames streamlines synthetic data generation while preserving data privacy. To recap, BigQuery DataFrames is a Python client for BigQuery, providing pandas-compatible APIs with computations pushed down to BigQuery. Gretel offers a comprehensive toolbox for synthetic data generation using cutting-edge machine learning techniques, including large language models (LLMs). This integration enables an integrated workflow, allowing users to easily transfer data from BigQuery to Gretel and save the generated results back to BigQuery. 
In this guide, we dive into the technical aspects of generating synthetic data to drive AI/ML innovation, while helping to ensure high-data quality, privacy protection, and compliance with privacy regulations. We begin by working with a BigQuery patient records table, de-identifying the data in Part 1, and then generating synthetic data to save back to BigQuery in Part 2.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>

Setting the stage: Installation and configuration
You can start by using BigQuery Studio as the notebook runtime, with BigFrames pre-installed. We assume you have a Google Cloud project set up and you are familiar with Pandas.
Step 1: Install the Gretel Python client and BigQuery DataFrames:

code_block
<ListValue: [StructValue([(‘code’, ‘%%capture\r\n!pip install -Uqq “gretel-client>=0.22.0"\r\n# Install bigframes if not already\r\n# %%capture\r\n# !pip install bigframes’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83e3341070>)])]>

Step 2: Initialize the Gretel SDK and BigFrames: You’ll need a Gretel API key to access their services. You can obtain one from the Gretel console.

code_block
<ListValue: [StructValue([(‘code’, ‘from gretel_client import Gretel\r\nfrom gretel_client.bigquery import BigFrames\r\n\r\n\r\ngretel = Gretel(api_key="prompt", validate=True, project_name="bigframes-demo")\r\n\r\n\r\n# This is the core interface we will use moving forward!\r\ngretel_bigframes = BigFrames(gretel)\r\n\r\nimport bigframes.pandas as bpd\r\nimport bigframes\r\n\r\n\r\nBIGQUERY_PROJECT = "gretel-vertex-demo"\r\n\r\n\r\n# Set BigFrames options\r\nbpd.options.display.progress_bar = None\r\nbpd.options.bigquery.project = BIGQUERY_PROJECT’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83e3341700>)])]>

Part 1: De-identifying and processing data with Gretel Transform v2
Before generating synthetic data, de-identifying personally identifiable information (PII) is a crucial first step towards data anonymization. Gretel’s Transform v2 (Tv2) provides a powerful and scalable framework for this and various other data processing tasks. Tv2 combines advanced transformation techniques with named entity recognition (NER) capabilities, enabling efficient handling of large datasets. Beyond PII de-identification, Tv2 can be used for data cleansing, formatting, and other preprocessing steps, making it a versatile tool in the data preparation pipeline. Learn more about Gretel Transform v2.
Step 1: Create a BigFrames DataFrame from your BigQuery table:

code_block
<ListValue: [StructValue([(‘code’, ‘# Define the source project and dataset\r\nproject_id = "gretel-public"\r\ndataset_id = "public"\r\ntable_id = "sample-patient-events"\r\n\r\n\r\n# Construct the table path\r\ntable_path = f"{project_id}.{dataset_id}.{table_id}"\r\n\r\n\r\n# Read the table into a BigFrames DataFrame\r\ndf = bpd.read_gbq_table(table_path)\r\ndf = df.dropna()\r\n# Display the DataFrame\r\ndf.peek()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83e3341880>)])]>

The table below is a subset of the DataFrame we will transform. We hash the `patient_id` column and create replacement first and last names based on the value of the `sex` column.

code_block
<ListValue: [StructValue([(‘code’, ‘patient_id first_name last_name sex race\r\npmc-6545753-1 Antonio Fernandez Male Hispanic\r\npmc-6192350-1 Ana Silva Female Other\r\npmc-6332555-4 Lina Chan Female Asian\r\npmc-6089485-1 Omar Hassan Male Black or African American\r\npmc-6100673-1 Aisha Khan Female Asian’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83e3341f40>)])]>

Step 2: Transform the data with Gretel:

code_block
<ListValue: [StructValue([(‘code’, ‘# De-identification configuration\r\n\r\n\r\ntransform_config = """\r\nschema_version: "1.0"\r\nmodels:\r\n – transform_v2:\r\n steps:\r\n – rows:\r\n update:\r\n – name: patient_id\r\n value: this | hash | truncate(10, end=\’\’)\r\n – name: first_name\r\n value: >\r\n fake.first_name_female() if row.sex == \’Female\’ else\r\n fake.first_name_male() if row.sex == \’Male\’ else\r\n fake.first_name()\r\n – name: last_name\r\n value: fake.last_name()\r\n"""\r\n# Submit a transform job against the BigFrames table\r\ntransform_results = gretel_bigframes.submit_transforms(transform_config, df)\r\n\r\n\r\n# Check out our Model ID, we can re-use this later to restore results.\r\nmodel_id = transform_results.model_id\r\n\r\n\r\nprint(f"Gretel Model ID: {model_id}\\n")\r\nprint(f"Gretel Console URL: {transform_results.model_url}")\r\ntransform_results.wait_for_completion()\r\ntransform_results.refresh()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83e3341040>)])]>

Step 3: Explore the de-identified data:

code_block
<ListValue: [StructValue([(‘code’, ‘# Take a look at the newly transformed BigFrames DataFrame\r\ntransformed_df = transform_results.transformed_df\r\ntransformed_df.peek()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83e3341610>)])]>

Below is a comparison of the original vs de-identified data.
Original:

code_block
<ListValue: [StructValue([(‘code’, ‘patient_id first_name last_name sex race\r\npmc-6545753-1 Antonio Fernandez Male Hispanic\r\npmc-6192350-1 Ana Silva Female Other\r\npmc-6332555-4 Lina Chan Female Asian\r\npmc-6089485-1 Omar Hassan Male Black or African American\r\npmc-6100673-1 Aisha Khan Female Asian’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83e33410d0>)])]>

De-identified:

code_block
<ListValue: [StructValue([(‘code’, ‘patient_id first_name last_name sex race\r\n389b63f369 John Hampton Male Hispanic\r\neff31024e6 Christine Carlson Female Other\r\n8af37475b6 Sarah Moore Female Asian\r\n7bd5f08fb8 Russell Zhang Male Black or African American\r\n1628622e23 Stacy Wilkinson Female Asian’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83e3341370>)])]>

Part 2: Generating synthetic data with Navigator Fine Tuning (LLM-based)
Gretel Navigator Fine Tuning (NavFT) generates high-quality, domain-specific synthetic data by fine-tuning pre-trained models on your datasets. Key features include:

Handles multiple data modalities: numeric, categorical, free text, time series, and JSON

Maintains complex relationships across data types and rows

Can introduce meaningful new patterns, potentially improving ML/AI task performance

Balances data utility with privacy protection

NavFT builds on Gretel Navigator’s capabilities, enabling the creation of synthetic data that captures the nuances of your specific data, including the distributions and correlations for numeric, categorical, and other column types, while leveraging the strengths of domain-specific pre-trained models. Learn more about Navigator Fine Tuning.
In this example, we will fine-tune a Gretel model on the de-identified data from Part 1.
Step 1: Fine-tune a model:

code_block
<ListValue: [StructValue([(‘code’, ‘# Prepare the training configuration\r\nbase_config = "navigator-ft" # Base configuration for training\r\n\r\n\r\n# Define the generation parameters\r\ngenerate_params = {\r\n "num_records": len(df), # Number of records to generate\r\n "temperature": 0.7 # Temperature parameter for data generation\r\n}\r\n\r\n\r\n# Submit the training job to Gretel\r\ntrain_results = gretel_bigframes.submit_train(\r\n base_config=base_config,\r\n dataframe=transformed_df,\r\n job_label="synthetic_patient_data",\r\n generate=generate_params,\r\n group_training_examples_by="patient_id", # Group training examples by patient_id\r\n order_training_examples_by="event_date" # Order training examples by event_date\r\n)\r\ntrain_results.wait_for_completion()\r\ntrain_results.refresh()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83e3341e20>)])]>

Step 2: Fetch the Gretel Synthetic Data Quality Report:

code_block
<ListValue: [StructValue([(‘code’, ‘# Display the full report within this notebook\r\ntrain_results.report.display_in_notebook()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83e23ed040>)])]>

The image below shows the high-level metrics from the Gretel Synthetic Data Quality Report. Please see the Gretel documentation for more details about how to interpret this report.

Step 3: Generate synthetic data from the fine-tuned model, evaluate data quality and privacy, and write back to a BQ table.

code_block
<ListValue: [StructValue([(‘code’, ‘# Fetch the synthetically generated data\r\ndf_synth = train_results.fetch_report_synthetic_data()\r\ndf_synth.peek()\r\n\r\n\r\n# Write to the destination table in BQ.\r\ndf_synth.to_gbq(table_path)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83e23ed8e0>)])]>

Below is a sample of the final synthetic data:

code_block
<ListValue: [StructValue([(‘code’, ‘patient_id first_name last_name date_of_birth\r\nc704235f91 Andrew Sanchez 1986-01-19\r\nc704235f91 Andrew Sanchez 1986-01-19\r\nc704235f91 Andrew Sanchez 1986-01-19\r\nc704235f91 Andrew Sanchez 1986-01-19\r\na8e410d3ff Jacqueline Smith 2016-07-15\r\n\r\nsex race weight height\r\nMale Hispanic 190.0 70.0\r\nMale Hispanic 190.0 70.0\r\nMale Hispanic 190.0 70.0\r\nMale Hispanic 190.0 70.0\r\nFemale Asian 89.0 48.0\r\n\r\nevent_id event_type event_date event_name\r\n1 Admission 01/21/2023 <NA>\r\n2 Treatment 01/22/2023 IV Immunosuppression\r\n3 Diagnosis Test 01/22/2023 Follow-up Examination\r\n4 Discharge 01/26/2023 <NA>\r\n1 Admission 07/15/2023 <NA>\r\n\r\nprovider_name reason result\r\nDr. Angela Clinic Elective right lower lobectomy Transplant successful\r\nOral Health Center Postoperative care Stable with minimal side effects\r\nOrthopedic Inst. Routine check after surgery No signs of infection or relapse\r\nCity Hospital ER End of hospital stay Stabilized with normal vitals\r\nMain Hospital Initial Checkup <NA>\r\n\r\ndetails\r\n{}\r\n{"dosage":"Standard", "frequency":"Twice daily"}\r\n{}\r\n{"referral":"Outpatient clinic"}\r\n{}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83e23ede80>)])]>

A few things to note about the synthetic data:

The various modalities (JSON structures, free text) are preserved and fully synthetic while being semantically correct.

Because of the group-by/order-by hyperparameters that were used during fine-tuning, the records are clustered on a per patient basis during generation.

How to use BigQuery with Gretel
This technical guide provides a foundation for leveraging Gretel AI and BigQuery DataFrames to generate and utilize synthetic data. By following these examples and exploring the Gretel documentation, you can unlock the power of synthetic data to enhance your data science, analytics, and AI development workflows while ensuring data privacy and compliance.
To learn more about generating synthetic data with BigQuery DataFrames and Gretel, explore the following resources:

Gretel documentation

BigQuery DataFrames documentation

Overview and Architecture blog

Github code examples

Gretel BigFrames integration documentation 

Start generating your own synthetic data today and unlock the full potential of your data!

Googlers Firat Tekiner, Jeff Ferguson and Sandeep Karmarkar contributed to this blog post. Many Googlers contributed to make these features a reality.

AI Summary and Description: Yes

Summary: The text outlines the integration of Gretel with BigQuery DataFrames for synthetic data generation, focusing on data privacy and compliance through techniques like de-identification and the use of Large Language Models (LLMs). It highlights the technical details involved in the setup, transformation, and generation of quality synthetic datasets that adhere to privacy regulations.

Detailed Description:
The content discusses an advanced methodology for synthetic data generation, emphasizing privacy, compliance, and integration with notable tools. Here are the major points of relevance for professionals in AI, cloud, and infrastructure security:

– **Synthetic Data Generation**: The process involves using Gretel to automate the generation of synthetic data from an existing BigQuery dataset. This is crucial in contexts where sensitive data handling is necessary while still allowing for data-driven decisions.

– **Integration with BigQuery**:
– BigQuery DataFrames serve as a flexible interface for working with data stored in Google Cloud, providing functionality for seamless computational efforts.
– The integration helps build an end-to-end workflow, facilitating data transfer and processing with ease.

– **Data Privacy and Compliance**:
– The initial steps emphasize the importance of de-identifying personal identifiable information (PII) to maintain compliance with regulations such as GDPR.
– Gretel’s Transform v2 and Navigator Fine Tuning (NavFT) are key technologies in this process, which allow for effective handling of sensitive data while retaining its analytical utility.

– **Use of LLMs**: The document highlights how advanced machine learning techniques, including LLMs, are used within Gretel. This supports the generation of synthetic data that captures complex relationships between data points.

– **Technical Workflow**:
– **De-identifying Data**: The guide captures the essential first step of de-identification, leveraging Gretel’s tools to ensure data is anonymized before any synthetic generation occurs.
– **Transformations and Reporting**: The use of structured code outlines how users can execute transformations on data, track changes, and evaluate the quality and privacy of the synthetic data generated.

– **Domain-Specific Outputs**: The synthetic data produced maintains the integrity of various modalities (i.e., JSON structures, numeric, categorical data), offering a versatile solution adaptable to numerous scenarios in AI and data science.

– **Real-World Applications**: The insights also touch on applications in healthcare data, demonstrating the relevance of such tools in sectors where compliance and privacy are particularly critical.

In conclusion, this guide not only illustrates how to generate synthetic data using Gretel and BigQuery but also emphasizes the imperative need for privacy and compliance in data management practices, aligning well with current trends in data security and informational governance. This is particularly significant for professionals tasked with safeguarding sensitive information while leveraging data for innovation.