Main Content

spreadsheetDatastore

Datastore for spreadsheet files

Description

Use a spreadsheetDatastore object to manage large collections of spreadsheet files where the collection does not necessarily fit in memory. You can create a spreadsheetDatastore object using the spreadsheetDatastore function, specify its properties, and then import the data using object functions.

Creation

Description

example

ssds = spreadsheetDatastore(location) creates a spreadsheet datastore from the collection of data specified by location.

ssds = spreadsheetDatastore(location,Name,Value) specifies additional parameters and properties for ssds using one or more name-value arguments. For example, spreadsheetDatastore(location,"FileExtensions",[".xlsx",".xls"]) specifies which files to include in the datastore depending on the file extensions.

Input Arguments

expand all

Files or folders included in the datastore, specified as a FileSet object, as file paths, or as a DsFileSet object.

  • FileSet object — You can specify location as a FileSet object. Specifying the location as a FileSet object leads to a faster construction time for datastores compared to specifying a path or DsFileSet object. For more information, see matlab.io.datastore.FileSet.

  • File path — You can specify a single file path as a character vector or string scalar. You can specify multiple file paths as a cell array of character vectors or a string array.

  • DsFileSet object — You can specify a DsFileSet object. For more information, see matlab.io.datastore.DsFileSet.

Files or folders may be local or remote:

  • Local files or folders — Specify local paths to files or folders. If the files are not in the current folder, then specify full or relative paths. Files within subfolders of the specified folder are not automatically included in the datastore. You can use the wildcard character (*) when specifying the local path. This character specifies that the datastore include all matching files or all files in the matching folders.

  • Remote files or folders — Specify full paths to remote files or folders as a uniform resource locator (URL) of the form hdfs:///path_to_file. For more information, see Work with Remote Data.

When you specify a folder, the datastore includes only files with supported file formats and ignores files with any other format. To specify a custom list of file extensions to include in your datastore, see the FileExtensions property.

The spreadsheetDatastore function supports files with the extensions: .xls, .xlsx, .xlsm, .xltx, and .xltm.

Example: "file1.xlsx"

Example: "../dir/data/file1.xlsx"

Example: ["C:\dir\data\file1.xlsx","C:\dir\data\file2.xlsx"]

Example: "s3://bucketname/path_to_files/*.xls"

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: ssds = spreadsheetDatastore("C:\dir\spreadsheetdata","FileExtensions",[".xls",".xlsm"])

Spreadsheet file extensions, specified as the name-value argument consisting of "FileExtensions" and a character vector, cell array of character vectors, string scalar, or string array.

  • If you do not specify "FileExtensions", then spreadsheetDatastore automatically includes all files with .xls, .xlsx, .xlsm, .xltx, and .xltm extensions in the specified path.

  • If you want to include spreadsheets with non-standard file extensions in the SpreadsheetDatastore, then specify those extensions explicitly.

  • If you want to create a SpreadsheetDatastore for files without any extensions, then specify "FileExtensions" as an empty character vector, ''.

Example: "FileExtensions",''

Example: "FileExtensions",".xls"

Example: "FileExtensions",[".xlsx",".xlsm"]

Data Types: char | cell | string

Subfolder inclusion flag, specified as the name-value argument consisting of "IncludeSubfolders" and true, false, 0, or 1. Specify true to include all files and subfolders within each folder or false to include only the files within each folder.

If you do not specify "IncludeSubfolders", then the default value is false.

Example: "IncludeSubfolders",true

Data Types: logical | double

Output datatype, specified as the name-value argument consisting of "OutputType" and one of these values:

  • "auto" — Detects if the output from the datastore should be a table or a timetable based on whether you specify the "RowTimes" name-value argument. If you specify "RowTimes" then the output is a timetable; otherwise, the output is a table.

  • "table" — Return a table.

  • "timetable" — Return a timetable.

The value of OutputType determines the data type returned by the preview, read, and readall functions. Use this option in conjunction with the "RowTimes" name-value pair to return timetables from spreadsheetDatastore.

Example: "OutputType","timetable"

Data Types: char | string

Alternate file system root paths, specified as the name-value argument consisting of "AlternateFileSystemRoots" and a string vector or a cell array. Use "AlternateFileSystemRoots" when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB® Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use "AlternateFileSystemRoots" to associate the root paths.

  • To associate a set of root paths that are equivalent to one another, specify "AlternateFileSystemRoots" as a string vector. For example,

    ["Z:\datasets","/mynetwork/datasets"]

  • To associate multiple sets of root paths that are equivalent for the datastore, specify "AlternateFileSystemRoots" as a cell array containing multiple rows where each row represents a set of equivalent root paths. Specify each row in the cell array as either a string vector or a cell array of character vectors. For example:

    • Specify "AlternateFileSystemRoots" as a cell array of string vectors.

      {["Z:\datasets", "/mynetwork/datasets"];...
       ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}

    • Alternatively, specify "AlternateFileSystemRoots" as a cell array of cell array of character vectors.

      {{'Z:\datasets','/mynetwork/datasets'};...
       {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}

The value of "AlternateFileSystemRoots" must satisfy these conditions:

  • Contains one or more rows, where each row specifies a set of equivalent root paths.

  • Each row specifies multiple root paths and each root path must contain at least two characters.

  • Root paths are unique and are not subfolders of one another.

  • Contains at least one root path entry that points to the location of the files.

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

Output data type of text variables, specified as the name-value argument consisting of "TextType" and either "char" or "string".

  • If the output table from the read, readall, or preview functions contains text variables, then "TextType" specifies the data type of those variables.

  • If "TextType" is "char", then the output is a cell array of character vectors.

  • If "TextType" is "string", then the output has type string.

Data Types: char | string

In addition to these name-value pairs, you also can specify any of the properties on this page as name-value pairs (except for the Files property).

Properties

expand all

SpreadsheetDatastore properties describe the format of the files in a datastore object, and control how the data is read from the datastore. Except for the Files property, you can specify the value of SpreadsheetDatastore properties using name-value pair arguments when you create the datastore object. To view or modify a property after creating the object, use the dot notation:

ds = spreadsheetDatastore("airlinesmall_subset.xlsx");
ds.Sheets = [1 2];
ds.Range = "A1:C6";

File Properties

Files included in the datastore, resolved as a cell array of character vectors or a string array, where each character vector or string is a full path to a file. The location argument in the spreadsheetDatastore and datastore functions define these files.

The first file specified in the cell array determines the variable names and format information for all files in the datastore.

Example: ["C:\dir\data\file1.xls";"C:\dir\data\file2.xls"]

Data Types: cell | string

Number of lines to skip at the beginning of each sheet when reading, specified as a positive integer. When you also specify the Range property, NumHeaderLines is the number of lines to skip at the beginning of the specified block of data.

Data Types: double

Sheets in files, specified as a character vector, cell array of character vectors, string scalar, or string array containing sheet names, or as a numeric vector of sheet indices. The empty character vector '' indicates that all sheets in the files are included.

Example: ["sheet1","sheet7"]

Example: [3 5 7]

Data Types: char | cell | string | double

Row and column bounds, specified as a character vector or string scalar that defines a rectangular block of data in the sheets. The empty character vector '' indicates that the bounds are the beginning of the file and the end of the data.

Example: "B1:T7"

Example: "A:C"

Data Types: char | string

Indicator for reading the first row of the first file in the datastore as variable names, specified as either true (1) or false (0).

  • If true, then the first nonheader row of the first file determines the variable names for the data.

  • If false, then the first nonheader row of the first file contains the first row of data. Default variable names are assigned as Var1, Var2, and so on.

Data Types: logical | double

Flag to preserve variable names, specified as either "modify" or "preserve".

  • "modify" — Convert invalid variable names (as determined by the isvarname function) to valid MATLAB identifiers.

  • "preserve" — Preserve variable names that are not valid MATLAB identifiers such as variable names that include spaces and non-ASCII characters.

Starting in R2019b, variable names and row names can include any characters, including spaces and non-ASCII characters. Also, they can start with any characters, not just letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname function). To preserve these variable names and row names, set the value of VariableNamingRule to "preserve". Variable names are not refreshed when the value of VariableNamingRule is changed from "modify" to "preserve".

Data Types: char | string

Names of variables in the datastore, specified as a character vector, cell array of character vectors, string scalar, or string array. Specify the variable names in the order in which they appear in the files. If you do not specify the variable names, the datastore detects them from the first nonheader line in the first file. You can specify VariableNames with a character vector or string scalar, however the datastore converts and stores the property value to a cell array of character vectors. When modifying the VariableNames property, the number of new variable names must match the number of original variable names.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of the VariableNamingRule parameter to "preserve".

If ReadVariableNames is false, then VariableNames defaults to ["Var1","Var2", ...].

Example: ["Time","Date","Quantity"]

Data Types: char | cell | string

Variable types, specified as "double", "char", "string", "categorical", or "datetime", which indicates the type of each variable when reading the data.

The list of variable types corresponds with the variables in VariableNames. Types double, char, and datetime can be automatically detected from the data. You can specify VariableTypes as a character vector or string scalar, however spreadsheetDatastore automatically converts it to a cell array of character vectors or a string array.

Example: ["char","categorical"]

Data Types: char | cell | string

Properties for preview, read, readall Table

Variables to read from the file, specified as a character vector, cell array of character vectors, string scalar, or string array. Each character vector or string contains the name of one variable. You can specify the variable names in any order. You can specify SelectedVariableNames with a character vector or string scalar, however spreadsheetDatastore automatically converts it to a cell array of character vectors or a string array.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of VariableNamingRule parameter to "preserve".

Example: ["Var3","Var7","Var4"]

Data Types: char | cell | string

Selected variable types, specified as "double", "char", "string", "categorical", or "datetime", which indicates the type of each selected variable when reading the data. The list of variable types corresponds with the variables in SelectedVariableNames. Types double, char, and datetime can be automatically detected from the data. You can specify SelectedVariableTypes as a character vector or string scalar, however it is automatically converted to a cell array of character vectors or string array.

Example: ["double","datetime"]

Data Types: char | cell | string

Amount of data to read in a call to the read function, specified as "file" or "sheet", or as a positive integer scalar.

  • If ReadSize is "file", then each call to read reads all the data one file at a time.

  • If ReadSize is "sheet", then each call to read reads all the data one sheet at a time.

  • If ReadSize is a positive integer, then each call to read reads the rows specified by ReadSize, or fewer if it reaches the end of the data.

When you change ReadSize from an integer scalar to "file" or "sheet", or conversely, the datastore resets using the reset function.

Data Types: char | string | double

Name of row times variable, specified as the name-value argument consisting of "RowTimes" and a variable name (such as "Date") or a variable index (such as 3).

RowTimes is a timetable-related parameter. Each row of a timetable is associated with a time, which is captured in a time vector for the timetable. The variable specified in RowTimes must contain a datetime or a duration vector.

If the value of "OutputType" is "timetable", but you do not specify "RowTimes", then spreadsheetDatastore uses the first datetime or duration variable as the row times for the timetable.

Properties for use by writeall

This property is read-only.

Folders used to construct datastore, returned as a cell array of character vectors. The cell array is oriented as a column vector. Each character vector is a path to a folder that contains data files. The location argument in the spreadsheetDatastore and datastore functions defines Folders when the datastore is created.

The Folders property is reset when you modify the Files property of a SpreadsheetDatastore object.

Data Types: cell

This property is read-only.

Formats supported for writing, returned as a row vector of strings. This property specifies the possible output formats when using writeall to write output files from the datastore.

This property is read-only.

Default output format, returned as a string scalar. This property specifies the default format when using writeall to write output files from the datastore.

Data Types: string

Object Functions

hasdataDetermine if data is available to read
numpartitionsNumber of datastore partitions
partitionPartition a datastore
previewPreview subset of data in datastore
readRead data in datastore
readallRead all data in datastore
writeallWrite datastore to files
resetReset datastore to initial state
sheetnamesQuery sheet names from datastore
transformTransform datastore
combineCombine data from multiple datastores
isPartitionableDetermine whether datastore is partitionable
isSubsettableDetermine whether datastore is subsettable
isShuffleableDetermine whether datastore is shuffleable

Examples

collapse all

Create a spreadsheetDatastore object using either a FileSet object or a file path.

Create a FileSet object. Create a spreadsheetDatastore object.

fs = matlab.io.datastore.FileSet("airlinesmall_subset.xlsx");
ssds = spreadsheetDatastore(fs)
ssds = 
  SpreadsheetDatastore with properties:

                      Files: {
                             ' ...\Documents\MATLAB\Examples\airlinesmall_subset.xlsx'
                             }
                    Folders: {
                             ' ...\Documents\MATLAB\Examples'
                             }
   AlternateFileSystemRoots: {}
                     Sheets: ''
                      Range: ''

  Sheet Format Properties:
             NumHeaderLines: 0
         VariableNamingRule: 'modify'
          ReadVariableNames: true
              VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more}
              VariableTypes: {'double', 'double', 'double' ... and 26 more}

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more}
      SelectedVariableTypes: {'double', 'double', 'double' ... and 26 more}
                   ReadSize: 'file'
                 OutputType: 'table'
                   RowTimes: []

  Write-specific Properties:
     SupportedOutputFormats: ["txt"    "csv"    "xlsx"    "xls"    "parquet"    "parq"]
        DefaultOutputFormat: "xlsx"

Alternatively, you can specify your data with a file path.

ssds = spreadsheetDatastore("airlinesmall_subset.xlsx");

Create a SpreadsheetDatastore object containing the file airlinesmall_subset.xlsx.

ssds = spreadsheetDatastore("airlinesmall_subset.xlsx")
ssds = 
  SpreadsheetDatastore with properties:

                      Files: {
                             ' ...\Documents\MATLAB\Examples\airlinesmall_subset.xlsx'
                             }
                    Folders: {
                             ' ...\Documents\MATLAB\Examples'
                             }
   AlternateFileSystemRoots: {}
                     Sheets: ''
                      Range: ''

  Sheet Format Properties:
             NumHeaderLines: 0
         VariableNamingRule: 'modify'
          ReadVariableNames: true
              VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more}
              VariableTypes: {'double', 'double', 'double' ... and 26 more}

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more}
      SelectedVariableTypes: {'double', 'double', 'double' ... and 26 more}
                   ReadSize: 'file'
                 OutputType: 'table'
                   RowTimes: []

  Write-specific Properties:
     SupportedOutputFormats: ["txt"    "csv"    "xlsx"    "xls"    "parquet"    "parq"]
        DefaultOutputFormat: "xlsx"

Display the sheet names for the file. The file contains one sheet per year.

sheetnames(ssds,1)
ans = 13x1 string
    "1996"
    "1997"
    "1998"
    "1999"
    "2000"
    "2001"
    "2002"
    "2003"
    "2004"
    "2005"
    "2006"
    "2007"
    "2008"

Specify the variable FlightNum in the second sheet as the data of interest, and preview the first eight rows.

ssds.Sheets = 2; 
ssds.SelectedVariableNames = "FlightNum";
preview(ssds)
ans=8×1 table
    FlightNum
    _________

      1014   
      1201   
       702   
      1184   
      1310   
      1759   
      1242   
      1558   

Read only the first three rows of variables DepTime and ArrTime in the first sheet.

ssds.ReadSize = 3;
ssds.Sheets = 1;
ssds.SelectedVariableNames = ["DepTime","ArrTime"];
read(ssds)
ans=3×2 table
    DepTime    ArrTime
    _______    _______

     2117       2305  
     1252       1511  
     1441       1708  

Read all of sheets four, five, and six.

ssds.Sheets = 4:6;
readall(ssds);

Use the OutputType and RowTimes name-value pairs to make SpreadsheetDatastore return timetables instead of tables.

Create a datastore for tsunamis_dated.xlsx. Specify "OutputType" as "timetable" so that SpreadsheetDatastore returns timetables instead of tables.

ssds = spreadsheetDatastore("tsunamis_dated.xlsx","OutputType","timetable");
preview(ssds)
ans=8×15 timetable
       Date        Latitude    Longitude    ValidityCode          Validity          CauseCode               Cause                EarthquakeMagnitude       Country                 Location              MaxHeight    IidaMagnitude    Intensity    NumDeaths    DescDeaths              Time          
    ___________    ________    _________    ____________    ____________________    _________    ____________________________    ___________________    _____________    ____________________________    _________    _____________    _________    _________    __________    ________________________

    04-Mar-1952     42.15        143.85          4          {'definite tsunami'}        1        {'Earthquake'              }            8.1            {'JAPAN'    }    {'SE. HOKKAIDO ISLAND'     }        6.5           2.7              2           33           1         04-Mar-1952 01:22:41.000
    10-Jul-1958     58.34       -136.52          4          {'definite tsunami'}        3        {'Earthquake and Landslide'}            8.3            {'USA'      }    {'SE. ALASKA, AK'          }     524.26           4.6              5            5           1         10-Jul-1958 06:15:53.600
    22-May-1960     -39.5         -74.5          4          {'definite tsunami'}        1        {'Earthquake'              }            9.5            {'CHILE'    }    {'CENTRAL CHILE'           }         25           4.6              4         1260           3         22-May-1960 19:11:17.000
    20-Nov-1960      -6.8         -80.7          4          {'definite tsunami'}        1        {'Earthquake'              }            6.8            {'PERU'     }    {'PERU'                    }          9           3.2            2.5           66           2         20-Nov-1960 22:01:56.400
    28-Mar-1964      61.1        -147.5          4          {'definite tsunami'}        3        {'Earthquake and Landslide'}            9.2            {'USA'      }    {'PRINCE WILLIAM SOUND, AK'}         67           6.1              5          221           3         28-Mar-1964 03:36:14.000
    16-Jun-1964     38.65         139.2          4          {'definite tsunami'}        1        {'Earthquake'              }            7.5            {'JAPAN'    }    {'NW. HONSHU ISLAND'       }        5.8           2.7              2           26           1         16-Jun-1964 04:01:44.300
    14-Aug-1968       0.2         119.8          4          {'definite tsunami'}        1        {'Earthquake'              }            7.8            {'INDONESIA'}    {'BANDA SEA'               }         10           3.3              3          200           3         14-Aug-1968 22:14:19.400
    23-Feb-1969      -3.1         118.9          4          {'definite tsunami'}        1        {'Earthquake'              }            6.9            {'INDONESIA'}    {'MAKASSAR STRAIT'         }          4             2              2          600           3         23-Feb-1969 00:36:56.600

When you do not specify "RowTimes", spreadsheetDatastore uses the first datetime or duration variable as the row times. In this case, the Date variable is used for the row times. This data has two datetime variables: Date and Time. The Date variable does not include information about hours, minutes, or seconds, while the Time variable has the specific time of each event.

Specify the "RowTimes" option to use the event times (the Time variable) as the row times.

ssds = spreadsheetDatastore("tsunamis_dated.xlsx","OutputType","timetable","RowTimes","Time");
preview(ssds)
ans=8×15 timetable
              Time              Latitude    Longitude    ValidityCode          Validity          CauseCode               Cause                EarthquakeMagnitude       Country                 Location              MaxHeight    IidaMagnitude    Intensity    NumDeaths    DescDeaths       Date    
    ________________________    ________    _________    ____________    ____________________    _________    ____________________________    ___________________    _____________    ____________________________    _________    _____________    _________    _________    __________    ___________

    04-Mar-1952 01:22:41.000     42.15        143.85          4          {'definite tsunami'}        1        {'Earthquake'              }            8.1            {'JAPAN'    }    {'SE. HOKKAIDO ISLAND'     }        6.5           2.7              2           33           1         04-Mar-1952
    10-Jul-1958 06:15:53.600     58.34       -136.52          4          {'definite tsunami'}        3        {'Earthquake and Landslide'}            8.3            {'USA'      }    {'SE. ALASKA, AK'          }     524.26           4.6              5            5           1         10-Jul-1958
    22-May-1960 19:11:17.000     -39.5         -74.5          4          {'definite tsunami'}        1        {'Earthquake'              }            9.5            {'CHILE'    }    {'CENTRAL CHILE'           }         25           4.6              4         1260           3         22-May-1960
    20-Nov-1960 22:01:56.400      -6.8         -80.7          4          {'definite tsunami'}        1        {'Earthquake'              }            6.8            {'PERU'     }    {'PERU'                    }          9           3.2            2.5           66           2         20-Nov-1960
    28-Mar-1964 03:36:14.000      61.1        -147.5          4          {'definite tsunami'}        3        {'Earthquake and Landslide'}            9.2            {'USA'      }    {'PRINCE WILLIAM SOUND, AK'}         67           6.1              5          221           3         28-Mar-1964
    16-Jun-1964 04:01:44.300     38.65         139.2          4          {'definite tsunami'}        1        {'Earthquake'              }            7.5            {'JAPAN'    }    {'NW. HONSHU ISLAND'       }        5.8           2.7              2           26           1         16-Jun-1964
    14-Aug-1968 22:14:19.400       0.2         119.8          4          {'definite tsunami'}        1        {'Earthquake'              }            7.8            {'INDONESIA'}    {'BANDA SEA'               }         10           3.3              3          200           3         14-Aug-1968
    23-Feb-1969 00:36:56.600      -3.1         118.9          4          {'definite tsunami'}        1        {'Earthquake'              }            6.9            {'INDONESIA'}    {'MAKASSAR STRAIT'         }          4             2              2          600           3         23-Feb-1969

Version History

Introduced in R2016a