GCP Vertex AI Dataset Import File Generation
2 min readOct 7, 2021
Dataset preparation for VertexAI requires creation of an Import File accompanying the dataset.
Import File contains1. Path of The Image
2. Is it Training, Test, Validation Image ?
3. What is the Label(s) - Classification, Where is the Bounding Box(es) for Detection etc.
Both JSONL& CSV excepting specific use cases such as Entity Extraction in Text where only JSONL is supported. The code below is for CSV File Generation along with GCS Parallelized Uploads.
Code
Import Modules and Define Path to Service Account Token
import json
import os
import csv
from google.cloud import storage
import threadingos.environ["GOOGLE_APPLICATION_CREDENTIALS"]="PATH_TO_SERVICE_ACCOUNT_JSON_PATH"
Helper Functions
def write_csv(fName, row_data):
with open(fName, 'a', newline='\n') as myfile:
wr = csv.writer(myfile, delimiter = ",")
wr.writerow(row_data)# Skip this part if you dont want to run the upload of dataset also automatically
def upload_to_bucket(blob_name, path_to_file, bucket):
blob = bucket.blob(blob_name)
blob.upload_from_filename(path_to_file)
Define Paths
#Base Dataset Path from where
dataset_path = "/Users/vamramak/Downloads/dataset"
folder_list = ["training","validation"]
label_list = ["00-damage","01-whole"]# Bucket Information
bucket_name = "vamramak-automl-dataset"# Where should the CSV be created
csv_output_path = dataset_path + "/inputFile.csv"# Intialize Client
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
Scan through the Folder Structure
# Simple Scan
for folder in folder_list:
for label in label_list:
file_list_dir = dataset_path + "/" + folder + "/" + label
file_list = os.listdir(file_list_dir)
for file_name in file_list:
path_to_file = file_list_dir + "/" + file_name
object_name = folder + "/" + label + "/" + file_name
gcs_file_path = "gs://" + bucket_name + "/" + object_name
# Skip this part and use direct function call instead of Multi Threading if you have a really large dataset
# Skip the upload piece if the goal is to only generate the CSV File
t = threading.Thread(target = upload_to_bucket, args=(object_name, path_to_file, bucket,)).start()
row_data = [ folder , gcs_file_path , label ]
write_csv( csv_output_path , row_data)
Finally Upload the CSV File
upload_to_bucket("input_file.csv", csv_output_path, bucket)