Main Content

Transfer Data To Amazon S3 Buckets

To work with data in the cloud, you can upload to Amazon S3, then use datastores to access the data in S3 from the workers in your cluster.

  1. For efficient file transfers to and from Amazon S3, download and install the AWS Command Line Interface tool from https://aws.amazon.com/cli/.

  2. Specify your AWS Access Key ID, Secret Access Key, and Region of the bucket as system environment variables.

    • For example, on Linux, macOS, or Unix:

      export AWS_ACCESS_KEY_ID="YOUR_AWS_ACCESS_KEY_ID"
      export AWS_SECRET_ACCESS_KEY="YOUR_AWS_SECRET_ACCESS_KEY" 
      export AWS_DEFAULT_REGION="us-east-1"
      

    • On Windows:

      set AWS_ACCESS_KEY_ID="YOUR_AWS_ACCESS_KEY_ID"
      set AWS_SECRET_ACCESS_KEY="YOUR_AWS_SECRET_ACCESS_KEY"
      set AWS_DEFAULT_REGION="us-east-1"
      

      To permanently set these environment variables, set them in your user or system environment.

    Note

    For MATLAB® releases prior to R2020a, use AWS_REGION instead of AWS_DEFAULT_REGION.

  3. Create a bucket for your data. Either use the AWS S3 web page or a command like the following:

    aws s3 mb s3://mynewbucket

  4. Upload your data using a command like the following:

    aws s3 cp mylocaldatapath s3://mynewbucket --recursive
    For example:
    aws s3 cp path/to/cifar10/in/the/local/machine s3://MyExampleCloudData/cifar10/ --recursive

  5. After creating a cloud cluster, to copy your AWS credentials to your cluster workers, in MATLAB, select Parallel > Manage Cluster Profiles. In the Cluster Profile Manager, select your cloud cluster profile. Scroll to the EnvironmentVariables property and add AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_DEFAULT_REGION.

After you store your data in Amazon S3, you can use datastores to access the data from your cluster workers. Simply create a datastore pointing to the URL of the S3 bucket. For example, the following sample code shows using an imageDatastore to access an S3 bucket. Replace 's3://MyExampleCloudData/cifar10' with the URL of your S3 bucket.

imds = imageDatastore('s3://MyExampleCloudData/cifar10',...
 'IncludeSubfolders',true, ...
 'LabelSource','foldernames');
You can use an imageDatastore to read data from the cloud in your desktop client MATLAB, or when running code on your cluster workers, without changing your code. For details, see Work with Remote Data (MATLAB).

For a step-by-step example showing deep learning using data stored in Amazon S3, see the white paper Deep Learning with MATLAB and Multiple GPUs.

Headnode Limitation on S3 Uploads

The current S3 files upload works only if the headnode is an ephemeral storage instance type, such as C3. Otherwise, the S3 files are not visible in the worker nodes.

If you have a cluster with a dedicated headnode (of any worker instance type), the headnode is M5. M5 is not an ephemeral instance type, so S3 files are not visible in the worker nodes.

If you have a cluster that does not use a dedicated headnode with the worker instance type being a non-ephemeral storage instance, such as C4, then S3 files are not visible in the worker nodes.

Related Topics