Main Content

Work with Deep Learning Data in AWS

This example shows how to upload data to an Amazon S3™ bucket.

Before you can perform deep learning training in the cloud, you need to upload your data to the cloud. The example shows how to download the CIFAR-10 data set to your computer, and then upload the data to an Amazon S3 bucket for later use in MATLAB®. The CIFAR-10 data set is a labeled image data set commonly used for benchmarking image classification algorithms. Before running this example, you need access to an Amazon Web Services (AWS®) account. After you upload the data set to Amazon S3, you can try any of the examples in Parallel and Cloud.

Download CIFAR-10 to Local Machine

Specify a local directory in which to download the data set. The following code creates a folder in your current folder containing all the images in the data set.

currentFolder = pwd; 
[trainFolder,testFolder] = downloadCIFARToFolders(currentFolder);
Downloading CIFAR-10 data set...done.
Copying CIFAR-10 to folders...done.

Upload Local Data Set to Amazon S3 Bucket

To work with data in the cloud, you can upload to Amazon S3 and then use datastores to access the data in S3 from the workers in your cluster. The following steps describe how to upload the CIFAR-10 data set from your local machine to an Amazon S3 bucket.

1. Log in to your AWS account. For information on creating an account, see AWS: Account.

2. Create an IAM (Identity and Access Management) user using your AWS root account. For more information, see Creating an IAM user in your AWS account.

3. Generate an access key to receive an access key ID and a secret access key. For more information, see Managing Access Keys for IAM Users.

4. Specify your AWS access key ID and secret access key as environment variables in MATLAB. You might also need to specify the geographic region of your bucket (the value of this environment variable is typically determined automatically, but the bucket owner might require that you set it manually) and the session token (if you are using temporary security credentials, such as with AWS Federated Authentification).

setenv("AWS_ACCESS_KEY_ID","YOUR_AWS_ACCESS_KEY_ID"); 
setenv("AWS_SECRET_ACCESS_KEY","YOUR_AWS_SECRET_ACCESS_KEY");
setenv("AWS_SESSION_TOKEN","YOUR_AWS_SESSION_TOKEN"); % optional
setenv("AWS_DEFAULT_REGION","YOUR_AWS_DEFAULT_REGION"); % optional

5. Copy the environment variables from step 4 to your cluster workers. For information on how to create a cloud cluster, see Create Cloud Cluster (Parallel Computing Toolbox), and for more information on setting environment variables on cluster workers, see Set Environment Variables on Workers (Parallel Computing Toolbox).

6. For efficient file transfers to and from Amazon S3, download and install the AWS Command Line Interface tool from https://aws.amazon.com/cli/. This tool allows you to use commands specific to AWS in your MATLAB command window or at your system's command line.

7. Create a bucket for your data using the following command in your MATLAB command window:

!aws s3 mb s3://mynewbucket

8. Upload your data to the S3 bucket, replacing mylocaldatapath with the path to the CIFAR-10 data.

!aws s3 cp mylocaldatapath s3://mynewbucket --recursive

As an alternative to steps 6-8, you can upload data to Amazon S3 using the AWS S3 web page.

Use Data Set in MATLAB

After you store your data in Amazon S3, you can use datastores to access the data from your cluster workers. Simply create a datastore pointing to the URL of the S3 bucket. The following sample code shows how to use an imageDatastore to access an S3 bucket. Replace "s3://MyExampleCloudData/cifar10/train" with the URL of your S3 bucket.

imds = imageDatastore("s3://MyExampleCloudData/cifar10/train", ...
 IncludeSubfolders=true, ...
 LabelSource="foldernames");

With the CIFAR-10 data set now stored in Amazon S3, you can try any of the examples in Parallel and Cloud that show how to use CIFAR-10 in different use cases.

See Also

Related Topics