The CREATE MODEL statement for remote models over LLMs

This document describes the CREATE MODEL statement for creating remote models in BigQuery over Vertex AI models.

`CREATE MODEL` syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
`project_id.dataset.model_name`
REMOTE WITH CONNECTION `project_id.region.connection_id`
OPTIONS(
  ENDPOINT = 'vertex_ai_llm_endpoint'
  [, PROMPT_COL = 'prompt_col']
  [, INPUT_LABEL_COLS = input_label_cols]
  [, MAX_ITERATIONS = max_iterations]
  [, LEARNING_RATE_MULTIPLIER = learning_rate_multiplier]
  [, DATA_SPLIT_METHOD = 'data_split_method']
  [, DATA_SPLIT_EVAL_FRACTION = data_split_eval_fraction]
  [, DATA_SPLIT_COL = 'data_split_col']
  [, EVALUATION_TASK = 'evaluation_task'])
[AS SELECT prompt_column, label_column FROM `project_id.dataset.table_name`]

`CREATE MODEL`

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

`CREATE MODEL IF NOT EXISTS`

Creates and trains a new model only if the model doesn't exist in the specified dataset.

`CREATE OR REPLACE MODEL`

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

`model_name`

The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

Contain up to 1,024 characters
Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

`REMOTE WITH CONNECTION`

Syntax

`[PROJECT_ID].[LOCATION].[CONNECTION_ID]`

BigQuery uses a Cloud resource connection to interact with the Vertex AI endpoint.

The connection elements are as follows:

PROJECT_ID: the project ID of the project that contains the connection.
LOCATION: the location used by the connection. The connection must be in the same location as the dataset that contains the model.
CONNECTION_ID: the connection ID—for example, myconnection.
To find your connection ID, view the connection details in the Google Cloud console. The connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example projects/myproject/locations/connection_location/connections/myconnection.

If you are creating a remote model over an LLM that uses supervised tuning, you need to grant the Vertex AI Service Agent role to the connection's service account in the project where you create the model. Otherwise, you need to grant the Vertex AI User role to the connection's service account in the project where you create the model.

Example

`myproject.us.my_connection`

`ENDPOINT`

Syntax

ENDPOINT = 'vertex_ai_llm_endpoint'

Description

The model name of the target Vertex AI LLM.

BigQuery ML identifies and uses the full endpoint of the hosted LLM based on the model name that you specify. For example, https://us-central1-aiplatform.googleapis.com/v1/projects/your_project_id/locations/us-central1/publishers/google/models/text-embedding:predict. The resource portion of the URI is saved as the remote endpoint value in the model metadata. For example, projects/your_project_id/locations/us-central1/publishers/google/models/text-embedding.

Arguments

A STRING value that contains the model name of the target Vertex AI LLM. The following LLMs are supported:

Gemini API multimodal models

The following Gemini API multimodal models are supported:

gemini-1.5-flash-001 and gemini-1.5-flash-002: you can process text, image, audio, video, and PDF input by using this model.
gemini-1.5-pro-001 and gemini-1.5-pro-002: you can process text, image, audio, video, and PDF input by using this model.
gemini-1.0-pro-vision (Preview): you can process image and video input by using this model. You can specify the model name or the alias gemini-pro-vision as the endpoint value.

To always use the latest version of the model, specify the model name without any version information. For example, gemini-1.5-flash or gemini-1.5-pro.

You can't specify a version of the gemini-1.0-pro-vision model. The latest stable version is always used.

For more information on Gemini API model versioning, see Gemini model versions and lifecycle.

After you create a remote model based on the gemini-1.5-flash-001 or gemini-1.5-pro-001 models, you can use the model with the ML.GENERATE_TEXT function to analyze content in a BigQuery object table using a prompt you provide as a function argument, or to generate text based on a prompt you provide in a query or in a column in a standard table.

After you create a remote model based on a gemini-1.0-pro-vision model, you can use the model with the ML.GENERATE_TEXT function to analyze content in a BigQuery object table using a prompt you provide as a function argument.

Gemini API text models

The gemini-1.0-pro Vertex AI Gemini API text model is supported.

To specify a version for the model, do the following:

To use the latest stable release of the model, specify either gemini-1.0-pro or the alias gemini-pro.
To use a specific stable release of the model, specify gemini-1.0-pro and append -version to the model name. For example, gemini-1.0-pro-001.
To use the preview release of the model, specify gemini-1.0-pro and append -preview-release date to the model name. For example, gemini-1.0-pro-preview-0409. Preview versions of the model aren't generally available (GA) and aren't recommended for production use.

For more information on Vertex AI Gemini API model versioning, see Gemini model versions and lifecycle.

After you create a remote model based on a gemini-1.0-pro model, you can use the model with the ML.GENERATE_TEXT function to generate text based on a prompt you provide in a query or from a column in a standard table.

Claude models

The following Anthropic Claude models are supported:

claude-3-5-sonnet@20240620
claude-3-sonnet@20240229
claude-3-haiku@20240307
claude-3-opus@20240229

Although Claude models are multimodal, you can only use text input with Claude models in BigQuery ML.

After you create a remote model based on a Claude model, you can use the model with the ML.GENERATE_TEXT function to generate text based on a prompt you provide in a query or from a column in a standard table.

`multimodalembedding` embedding models

The multimodalembedding embedding model is supported. You must specify the 001 version of the model, multimodalembedding@001.

After you create a remote model based on a multimodalembedding embedding model, you can use the model with the ML.GENERATE_EMBEDDING function to generate embeddings from text data in a BigQuery table or from visual content in a BigQuery object table.

text embedding models

The following embedding models are supported:

text-embedding or the alias textembedding-gecko
text-multilingual-embedding or the alias textembedding-gecko-multilingual

You must specify a supported model version.

After you create a remote model based on an embedding model, you can use the model with the ML.GENERATE_EMBEDDING function to generate embeddings from text data in a BigQuery table.

PaLM API text models

The following Vertex AI PaLM API text models are supported:

text-bison
text-bison-32k
text-unicorn

You can specify a particular version of a text model by appending @version to the model name. For example, text-bison@version. Set the model version to the stable or latest version that is most appropriate for your use case. We recommend using the most recent stable version of a model for best performance. For information about text model version defaults, see Model versions.

After you create a remote model based on a PaLM API text model, you can use the ML.GENERATE_TEXT function to generate text based on a prompt you provide in a query or from a column in a standard table.

For information that can help you choose between the supported models, see Model information.

`PROMPT_COL`

Syntax

PROMPT_COL = 'prompt_col'

Description

The name of the prompt column in the training data table to use when performing supervised tuning. If you don't specify a value for this option, you must have a column named or aliased as prompt in your input data. You can only use this option with a remote model that targets a Vertex AI gemini-1.5-pro-002 or gemini-1.5-flash-002 model. If you specify this option, you must also specify the AS SELECT clause.

Arguments

A STRING value. The default value is prompt.

`INPUT_LABEL_COLS`

Syntax

INPUT_LABEL_COLS = input_label_cols

Description

The name of the label column in the training data table to use when performing supervised tuning. If you don't specify a value for this option, you must have a column named or aliased as label in your input data. You can only use this option with a remote model that targets a Vertex AI gemini-1.5-pro-002 or gemini-1.5-flash-002 model. If you specify this option, you must also specify the AS SELECT clause.

Arguments

A one-element ARRAY<STRING> value. The default value is an empty array.

`MAX_ITERATIONS`

Syntax

MAX_ITERATIONS = max_iterations

Description

The number of steps to run when performing supervised tuning. You can only use this option with a remote model that targets a Vertex AI gemini-1.5-pro-002 or gemini-1.5-flash-002 model. If you specify this option, you must also specify the AS SELECT clause.

When you use a Gemini model, BigQuery ML automatically converts the MAX_ITERATIONS value to epochs, which is what Gemini models use for training. The default value for MAX_ITERATIONS is the number of rows in the input data, which is equivalent to one epoch. To use multiple epochs, specify a multiple of the number of rows in your training data. For example, if you have 100 rows of input data and you want to use two epochs, specify 200 for the argument value. If you provide a value that isn't a multiple of the number of rows in the input data, BigQuery ML rounds up to the nearest epoch. For example, if you have 100 rows of input data and you specify 101 for the MAX_ITERATIONS value, training is performed with two epochs.

For more information about the parameters that are used to tune Gemini models, see Create a tuning job.

For more guidance on choosing the number of epochs for Gemini models, see Recommended configurations.

Arguments

An INT64 value between 1 and ∞. Typically, 100 steps takes about an hour to complete. The default value is 300.

`LEARNING_RATE_MULTIPLIER`

Syntax

LEARNING_RATE_MULTIPLIER = learning_rate_multiplier

Description

A multiplier to apply to the recommended learning rate when performing supervised tuning. You can only use this option with a remote model that targets a Vertex AI gemini-1.5-pro-002 or gemini-1.5-flash-002 model. If you specify this option, you must also specify the AS SELECT clause.

Arguments

A positive FLOAT64 value. The default value is 1.0.

`DATA_SPLIT_METHOD`

Syntax

DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' }

Description

The method used to split input data into training and evaluation sets when performing supervised tuning. You can only use this option with a remote model that targets a Vertex AI gemini-1.5-pro-002 or gemini-1.5-flash-002 model. If you specify this option, you must also specify the AS SELECT clause.

Training data is used to train the model. Evaluation data is used to avoid overfitting by using early stopping.

The percentage sizes of the data sets produced by the various arguments for this option are approximate. Larger input data sets come closer to the percentages described than smaller input data sets do.

You can see the model's data split information in the following ways:

The data split method and percentage are shown in the Training Options section of the model's Details page on the BigQuery page of the Google Cloud console.
Links to temporary tables that contain the split data are available in the Model Details section of the model's Details page on the BigQuery page of the Google Cloud console. You can also return this information from the DataSplitResult field in the BigQuery API. These tables are saved for 48 hours. If you need this information for more than 48 hours, then you should export this data or copy it to permanent tables.

Arguments

This option accepts the following values:

AUTO_SPLIT: This is the default value. This option splits the data as follows:
- If there are fewer than 500 rows in the input data, then all rows are used as training data.
- If there are more than 500 rows in the input data, then data is randomized and split as follows:
  - If there are between 500 and 50,000 rows in the input data, then 20% of the data is used as evaluation data and 80% is used as training data.
  - If there are more than 50,000 rows, then 10,000 rows are used as evaluation data and the remaining rows are used as training data.
RANDOM: Data is randomized before being split into sets. To customize the data split, you can use this option with the DATA_SPLIT_EVAL_FRACTION option. If you don't specify that option, data is split in the same way as for the AUTO_SPLIT option.

A random split is deterministic: different training runs produce the same split results if the same underlying training data is used.

Note: A random split is based on the FARM_FINGERPRINT of the data (including the column name and schema), so tables with the same content but different column names and schemas might get different splitting and different evaluation metrics.
CUSTOM: Split data using the value provided in the DATA_SPLIT_COL option. The DATA_SPLIT_COL value must be the name of a column of type BOOL. Rows with a value of TRUE or NULL are used as evaluation data, and rows with a value of FALSE are used as training data.
SEQ: Split data sequentially by using the value in a specified column of one of the following types:
- NUMERIC
- BIGNUMERIC
- STRING
- TIMESTAMP
The data is sorted smallest to largest based on the specified column.

The first n rows are used as evaluation data, where n is the value specified for DATA_SPLIT_EVAL_FRACTION. The remaining rows are used as training data.

All rows with split values smaller than the threshold are used as training data. The remaining rows, including those with NULL values, are used as evaluation data.

Use the DATA_SPLIT_COL option option to identify the column that contains the data split information.
NO_SPLIT: No data split; all input data is used as training data.

`DATA_SPLIT_EVAL_FRACTION`

Syntax

DATA_SPLIT_EVAL_FRACTION = data_split_eval_fraction

Description

The fraction of the data to use as evaluation data when performing supervised tuning. Use when you specify RANDOM or SEQ as the value for the DATA_SPLIT_METHOD option. You can only use this option with a remote model that targets a Vertex AI gemini-1.5-pro-002 or gemini-1.5-flash-002 model. If you specify this option, you must also specify the AS SELECT clause.

Arguments

A FLOAT64 value in the range [0, 1.0]. The default is 0.2. The service maintains the accuracy of the input value to two decimal places.

`DATA_SPLIT_COL`

Syntax

DATA_SPLIT_COL = 'data_split_col'

Description

The name of the column to use to sort input data into the training or evaluation set when performing supervised tuning. Use when you are specifying CUSTOM or SEQ as the value for DATA_SPLIT_METHOD. You can only use this option with a remote model that targets a Vertex AI gemini-1.5-pro-002 or gemini-1.5-flash-002 model. If you specify this option, you must also specify the AS SELECT clause.

If you are specifying SEQ as the value for DATA_SPLIT_METHOD, then the data is first sorted smallest to largest based on the specified column. The last n rows are used as evaluation data, where n is the value specified for DATA_SPLIT_EVAL_FRACTION. The remaining rows are used as training data.

If you are specifying CUSTOM as the value for DATA_SPLIT_COL, then you must provide the name of a column of type BOOL. Rows with a value of TRUE or NULLare used as evaluation data, rows with a value of FALSE are used as training data.

The column you specify for DATA_SPLIT_COL can't be used as a feature or label, and the column is excluded from features automatically.

Arguments

A STRING value.

`EVALUATION_TASK`

Syntax

EVALUATION_TASK = 'evaluation_task'

Description

When performing supervised tuning, the type of task that you want to tune the model to perform. You can only use this option with a remote model that targets a Vertex AI gemini-1.5-pro-002 or gemini-1.5-flash-002 model. If you specify this option, you must also specify the AS SELECT clause.

Arguments

A STRING value. The valid options are the following:

TEXT_GENERATION
CLASSIFICATION
SUMMARIZATION
QUESTION_ANSWERING
UNSPECIFIED

The default value is UNSPECIFIED.

`AS SELECT`

Syntax

AS SELECT prompt_column, label_column FROM
  `project_id.dataset.table_name`

Description

Provides the training data to use when performing supervised tuning. You can only use this option with a remote model that targets a Vertex AI gemini-1.5-pro-002 or gemini-1.5-flash-002 model. If you specify this option, you must also specify the AS SELECT clause.

Arguments

prompt_column: The name of the column in the training data table that contains the prompt for evaluating the content in the label_column column. This column must be of STRING type or be cast to STRING. If you specify a value for the PROMPT_COL option, you must specify the same value for prompt_column. Otherwise this value must be prompt. If your table does not have a prompt column, use an alias to specify an existing table column. For example, SELECT AS hint AS prompt, label FROM mydataset.mytable.
label_column: The name of the column in the training data table that contains the examples to train the model with. This column must be of STRING type or be cast to STRING. If you specify a value for the INPUT_LABEL_COLS option, you must specify the same value for label_column. Otherwise this value must be label. If your table does not have a label column, use an alias to specify an existing table column. For example, SELECT AS prompt, feature AS label FROM mydataset.mytable.
project_id: The project ID of the project that contains the training data table.
dataset: The dataset name of the dataset that contains the training data table. After optional data splitting, the number of rows in the training dataset has to be greater or equal to 10.
table_name: The name of the training data table.

Supervised tuning

If you create a remote model that references any of the following models, you can optionally configure supervised tuning at the same time:

gemini-1.5-pro-002
gemini-1.5-flash-002
gemini-1.0-pro-002 (Preview)

To configure supervised tuning, specify the AS SELECT clause, and optionally some of the other CREATE MODEL arguments that affect supervised tuning. Supervised tuning lets you train the model on your own data to make it better suited for your use case. However, not all models have their performance improved by tuning. To learn more about whether tuning would make sense for your use case, see Use cases for using supervised fine-tuning.

After you create a tuned model, use the EVALUATE function to ensure that the tuned model performs well for your use case. To learn more, try the Use tuning and evaluation to improve LLM performance tutorial.

Locations

For information about supported locations for supervised tuning with remote models over Vertex AI models, see Supported pipeline job and model upload regions.

Costs

When using supervised tuning with remote models over Vertex AI LLMs, costs are calculated based on the following:

The bytes processed from the training data table specified in the AS SELECT clause. These charges are billed from BigQuery to your project. For more information, see BigQuery pricing.
The number of tokens processed to tune the LLM. These charges are billed from Vertex AI to your project. For more information, see Vertex AI pricing.

Locations

You can create remote models over Vertex AI models in all of the regions that support Generative AI APIs, and also in the US and EU multi-regions.

You can create remote models over Claude models in all of the supported regions for Claude models.

Examples

The following examples create BigQuery ML remote models.

Create a model without tuning

The following example creates a BigQuery ML remote model over a Vertex AI gemini-1.5-flash-002 model:

CREATE OR REPLACE MODEL `mydataset.flash_model`
  REMOTE WITH CONNECTION `myproject.us.test_connection`
  OPTIONS(ENDPOINT = 'gemini-1.5-flash-002');

Create a tuned model

The following example creates a BigQuery ML remote model over a tuned version of the Vertex AI gemini-1.5-pro-002 model:

CREATE OR REPLACE MODEL `mydataset.tuned_model`
  REMOTE WITH CONNECTION `myproject.us.test_connection`
  OPTIONS (
    endpoint = 'gemini-1.5-pro-002',
    max_iterations = 500,
    prompt_col = 'prompt',
    input_label_cols = ['label'])
AS
SELECT
  CONCAT(
    'Please do sentiment analysis on the following text and only output a number from 0 to 5 where 0 means sadness, 1 means joy, 2 means love, 3 means anger, 4 means fear, and 5 means surprise. Text: ',
    sentiment_column) AS prompt,
  text_column AS label
FROM `mydataset.emotion_classification_train`;

What's next

For more information about using Vertex AI models with BigQuery ML, see Generative AI overview.
Try generating text from BigQuery data.
Try customizing a model by using supervised fine tuning.
Try generating embeddings from BigQuery data.

The CREATE MODEL statement for remote models over LLMs

CREATE MODEL syntax

CREATE MODEL

CREATE MODEL IF NOT EXISTS

CREATE OR REPLACE MODEL

model_name

REMOTE WITH CONNECTION

ENDPOINT

Gemini API multimodal models

Gemini API text models

Claude models

multimodalembedding embedding models

text embedding models

PaLM API text models

PROMPT_COL

INPUT_LABEL_COLS

MAX_ITERATIONS

LEARNING_RATE_MULTIPLIER

DATA_SPLIT_METHOD

DATA_SPLIT_EVAL_FRACTION

DATA_SPLIT_COL

EVALUATION_TASK

AS SELECT

Supervised tuning

Locations

Costs

Locations

Examples

Create a model without tuning

Create a tuned model

What's next

`CREATE MODEL` syntax

`CREATE MODEL`

`CREATE MODEL IF NOT EXISTS`

`CREATE OR REPLACE MODEL`

`model_name`

`REMOTE WITH CONNECTION`

`ENDPOINT`

`multimodalembedding` embedding models

`PROMPT_COL`

`INPUT_LABEL_COLS`

`MAX_ITERATIONS`

`LEARNING_RATE_MULTIPLIER`

`DATA_SPLIT_METHOD`

`DATA_SPLIT_EVAL_FRACTION`

`DATA_SPLIT_COL`

`EVALUATION_TASK`

`AS SELECT`