GCP Vertex AI Dataset Import File Generation

2 min readOct 7, 2021

Vertex AI Google Cloud — Google Cloud Vertex AI

Dataset preparation for VertexAI requires creation of an Import File accompanying the dataset.

Import File contains1. Path of The Image 
2. Is it Training, Test, Validation Image ?
3. What is the Label(s) - Classification, Where is the Bounding Box(es) for Detection etc.

Both JSONL& CSV excepting specific use cases such as Entity Extraction in Text where only JSONL is supported. The code below is for CSV File Generation along with GCS Parallelized Uploads.

Code

Import Modules and Define Path to Service Account Token

import json
import os 
import csv
from google.cloud import storage
import threadingos.environ["GOOGLE_APPLICATION_CREDENTIALS"]="PATH_TO_SERVICE_ACCOUNT_JSON_PATH"

Helper Functions

def write_csv(fName, row_data): 
    with open(fName, 'a', newline='\n') as myfile:
        wr = csv.writer(myfile,  delimiter = ",")
        wr.writerow(row_data)# Skip this part if you dont want to run the upload of dataset also automatically 
def upload_to_bucket(blob_name, path_to_file, bucket):
    blob = bucket.blob(blob_name)
    blob.upload_from_filename(path_to_file)

Define Paths

#Base Dataset Path from where 
dataset_path = "/Users/vamramak/Downloads/dataset"
folder_list = ["training","validation"]
label_list = ["00-damage","01-whole"]# Bucket Information
bucket_name = "vamramak-automl-dataset"# Where should the CSV be created
csv_output_path = dataset_path + "/inputFile.csv"# Intialize Client
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)

Scan through the Folder Structure

# Simple Scan 
for folder in folder_list:
    for label in label_list:
        file_list_dir = dataset_path + "/" + folder + "/" + label
        file_list = os.listdir(file_list_dir)
        
        for file_name in file_list:
            
            path_to_file = file_list_dir + "/" + file_name
            object_name = folder + "/" + label + "/" + file_name
            gcs_file_path =  "gs://" + bucket_name + "/" + object_name
            
            
            # Skip this part and use direct function call instead of Multi Threading if you have a really large dataset
            # Skip the upload piece if the goal is to only generate the CSV File
            
            t = threading.Thread(target = upload_to_bucket, args=(object_name, path_to_file, bucket,)).start()
            row_data = [ folder , gcs_file_path , label ]
            write_csv( csv_output_path , row_data)

Finally Upload the CSV File

upload_to_bucket("input_file.csv", csv_output_path, bucket)