GCP Vertex AI Dataset Import File Generation

Vamsi Ramakrishnan
2 min readOct 7, 2021
Vertex AI Google Cloud
Google Cloud Vertex AI

Dataset preparation for VertexAI requires creation of an Import File accompanying the dataset.

Import File contains1. Path of The Image 
2. Is it Training, Test, Validation Image ?
3. What is the Label(s) - Classification, Where is the Bounding Box(es) for Detection etc.

Both JSONL& CSV excepting specific use cases such as Entity Extraction in Text where only JSONL is supported. The code below is for CSV File Generation along with GCS Parallelized Uploads.

Code

Import Modules and Define Path to Service Account Token

import json
import os
import csv
from google.cloud import storage
import threading
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="PATH_TO_SERVICE_ACCOUNT_JSON_PATH"

Helper Functions

def write_csv(fName, row_data): 
with open(fName, 'a', newline='\n') as myfile:
wr = csv.writer(myfile, delimiter = ",")
wr.writerow(row_data)
# Skip this part if you dont want to run the upload of dataset also automatically
def upload_to_bucket(blob_name, path_to_file, bucket):
blob = bucket.blob(blob_name)
blob.upload_from_filename(path_to_file)

Define Paths

#Base Dataset Path from where 
dataset_path = "/Users/vamramak/Downloads/dataset"
folder_list = ["training","validation"]
label_list = ["00-damage","01-whole"]
# Bucket Information
bucket_name = "vamramak-automl-dataset"
# Where should the CSV be created
csv_output_path = dataset_path + "/inputFile.csv"
# Intialize Client
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)

Scan through the Folder Structure

# Simple Scan 
for folder in folder_list:
for label in label_list:
file_list_dir = dataset_path + "/" + folder + "/" + label
file_list = os.listdir(file_list_dir)

for file_name in file_list:

path_to_file = file_list_dir + "/" + file_name
object_name = folder + "/" + label + "/" + file_name
gcs_file_path = "gs://" + bucket_name + "/" + object_name


# Skip this part and use direct function call instead of Multi Threading if you have a really large dataset
# Skip the upload piece if the goal is to only generate the CSV File

t = threading.Thread(target = upload_to_bucket, args=(object_name, path_to_file, bucket,)).start()
row_data = [ folder , gcs_file_path , label ]
write_csv( csv_output_path , row_data)

Finally Upload the CSV File

upload_to_bucket("input_file.csv", csv_output_path, bucket)

Gist Link

--

--

Vamsi Ramakrishnan

I work for Google. All views expressed in this publication are my own. Google Cloud | ex-Oracle | Pre-Sales | https://goo.gl/aykaPB