parquetDatastore
Datastore for collection of Parquet files
Description
Use a ParquetDatastore
object to manage a collection of Parquet
files, where each individual Parquet file fits in memory, but the entire collection of files
does not necessarily fit. You can create a ParquetDatastore
object using
the parquetDatastore
function, specify its properties, and then import and
process the data using object functions.
Creation
Description
creates a datastore pds
= parquetDatastore(location
)pds
from the collection of Parquet files specified
by location
.
specifies additional parameters and properties for pds
= parquetDatastore(location
,Name,Value
)pds
using one or
more name-value pair arguments.
Input Arguments
location
— Files or folders to include in datastore
FileSet
object | file path | DsFileSet
object
Files or folders included in the datastore, specified as a
FileSet
object, as file paths, or as a DsFileSet
object.
FileSet
object — You can specifylocation
as aFileSet
object. Specifying the location as aFileSet
object leads to a faster construction time for datastores compared to specifying a path orDsFileSet
object. For more information, seematlab.io.datastore.FileSet
.File path — You can specify a single file path as a character vector or string scalar. You can specify multiple file paths as a cell array of character vectors or a string array.
DsFileSet
object — You can specify aDsFileSet
object. For more information, seematlab.io.datastore.DsFileSet
.
Files or folders may be local or remote:
Local files or folders — Specify local paths to files or folders. If the files are not in the current folder, then specify full or relative paths. Files within subfolders of the specified folder are not automatically included in the datastore. You can use the wildcard character (*) when specifying the local path. This character specifies that the datastore include all matching files or all files in the matching folders.
Remote files or folders — Specify full paths to remote files or folders as a uniform resource locator (URL) of the form
hdfs:///
. For more information, see Work with Remote Data.path_to_file
When you specify a folder, the datastore includes only files with supported file
formats and ignores files with any other format. To specify a custom list of file extensions to
include in your datastore, see the FileExtensions
property.
The parquetDatastore
function supports the
.parquet
file format.
Example: "myfile.parquet"
Example: "../dir/data/myfile.parquet"
Example: ["C:\dir\data\myfile01.parquet","C:\dir\data\myfile02.parquet"]
Example: "s3://bucketname/path_to_files/*.parquet"
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: "IncludeSubfolders",true
FileExtensions
— Extensions to include in datastore
character vector | cell array of character vectors | string scalar | string array
Extensions to include in datastore, specified as the name-value argument
consisting of "FileExtensions"
and a character vector, cell array
of character vectors, string scalar, or string array.
If you do not specify
"FileExtensions"
, thenparquetDatastore
automatically includes all files with.parquet
and.parq
extensions in the specified path.If you want to include parquet files with non-standard file extensions in the
parquetDatastore
, then specify those extensions explicitly.If you want to create a
parquetDatastore
for files without any extensions, then specify"FileExtensions"
as an empty character vector,''
.
Example: "FileExtensions",[".parquet",".parq"]
Example: "FileExtensions",".myformat"
Example: "FileExtensions",''
Data Types: char
| cell
| string
IncludeSubfolders
— Subfolder inclusion flag
false
(default) | true
Subfolder inclusion flag, specified as the name-value argument consisting of
"IncludeSubfolders"
and true
or
false
. Specify true
to include all files and
subfolders within each folder or false
to include only the files
within each folder.
If you do not specify "IncludeSubfolders"
, then the default value is
false
.
Example: "IncludeSubfolders",true
Data Types: logical
| double
OutputType
— Output datatype
"auto"
(default) | "table"
| "timetable"
Output datatype, specified as the name-value argument consisting of
"OutputType"
and one of these values:
The value of OutputType
determines the data type returned by the preview
, read
, and readall
functions. Use this option in conjunction with the
"RowTimes"
name-value pair to return timetables from
ParquetDatastore
.
Example: "OutputType","timetable"
Data Types: char
| string
VariableNamingRule
— Flag to preserve variable names
"modify"
(default) | "preserve"
Flag to preserve variable names, specified as either "modify"
or
"preserve"
.
"modify"
— Convert invalid variable names (as determined by theisvarname
function) to valid MATLAB® identifiers."preserve"
— Preserve variable names that are not valid MATLAB identifiers such as variable names that include spaces and non-ASCII characters.
Starting in R2019b, variable names and row names can include any characters, including
spaces and non-ASCII characters. Also, they can start with any characters, not just
letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname
function). To preserve these variable names and row names, set
the value of VariableNamingRule
to "preserve"
.
Variable names are not refreshed when the value of VariableNamingRule
is changed from "modify"
to "preserve"
.
Data Types: char
| string
AlternateFileSystemRoots
— Alternate file system root paths
string vector | cell array
Alternate file system root paths, specified as the name-value argument consisting of
"AlternateFileSystemRoots"
and a string vector or a cell array. Use
"AlternateFileSystemRoots"
when you create a datastore on a local
machine, but need to access and process the data on another machine (possibly of a different
operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB
Parallel Server™, and the data is stored on your local machines with a copy of the data available
on different platform cloud or cluster machines, you must use
"AlternateFileSystemRoots"
to associate the root paths.
To associate a set of root paths that are equivalent to one another, specify
"AlternateFileSystemRoots"
as a string vector. For example,["Z:\datasets","/mynetwork/datasets"]
To associate multiple sets of root paths that are equivalent for the datastore, specify
"AlternateFileSystemRoots"
as a cell array containing multiple rows where each row represents a set of equivalent root paths. Specify each row in the cell array as either a string vector or a cell array of character vectors. For example:Specify
"AlternateFileSystemRoots"
as a cell array of string vectors.{["Z:\datasets", "/mynetwork/datasets"];... ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}
Alternatively, specify
"AlternateFileSystemRoots"
as a cell array of cell array of character vectors.{{'Z:\datasets','/mynetwork/datasets'};... {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}
The value of "AlternateFileSystemRoots"
must satisfy these conditions:
Contains one or more rows, where each row specifies a set of equivalent root paths.
Each row specifies multiple root paths and each root path must contain at least two characters.
Root paths are unique and are not subfolders of one another.
Contains at least one root path entry that points to the location of the files.
For more information, see Set Up Datastore for Processing on Different Machines or Clusters.
Example: ["Z:\datasets","/mynetwork/datasets"]
Data Types: string
| cell
Properties
ParquetDatastore
properties describe the format of
the files in a datastore object, and control how the data is read from the datastore. With
the exception of the Files
property, you can specify the value of
ParquetDatastore
properties using name-value pair arguments when you
create the datastore object. To view or modify a property after creating the object, use the
dot notation.
Files
— Files included in datastore
cell array of character vectors | string array
Files included in the datastore, resolved as a cell array of character vectors or a
string array, where each character vector or string is a full path to a file. The
location
argument defines these files.
The first file specified in the cell array determines the variable names and format information for all files in the datastore.
Example: {"C:\dir\data\file1.ext";"C:\dir\data\file2.ext"}
Data Types: cell
| string
Folders
— Folders used to construct datastore
cell array of character vectors
This property is read-only.
Folders used to construct datastore, returned as a cell array of character vectors.
The cell array is oriented as a column vector. Each character vector is a path to a
folder that contains data files. The location
argument in the
parquetDatastore
and datastore
functions defines
Folders
when the datastore is created.
The Folders
property is reset when you modify the
Files
property of a ParquetDatastore
object.
Data Types: cell
RowFilter
— Filter to select rows to import
matlab.io.RowFilter
object
Filter to select rows to import, specified as a
matlab.io.RowFilter
object. The
matlab.io.RowFilter
object designates conditions each row must
satisfy to be included in your output table or timetable. If you do not specify
RowFilter
, then parquetDatastore
imports all
rows from the input Parquet file.
ReadSize
— Amount of data to read per read step
"rowgroup"
(default) | "file"
| positive integer
Amount of data to read per read step, specified as one of these values:
"rowgroup"
— Each read step reads the number of rows in the row groups of the Parquet data. To get the number of rows in each row group, see theRowGroupHeights
property of theParquetInfo
object."file"
— Each read step reads all of the data in one file.positive integer — Each read step reads the specified number of rows.
When you change ReadSize
from a positive integer to
"file"
or "rowgroup"
, or from
"file"
or "rowgroup"
to a positive integer,
MATLAB resets the datastore to an unread state, where no data has been read from
it.
In a parallel processing workflow (Parallel Computing Toolbox), the data is read in steps from each parallel worker. In a serial
workflow, the data is read in steps from the input location
.
Data Types: string
| char
| double
PartitionMethod
— Partition unit for parallel processing
"auto"
(default) | "file"
| "bytes"
| "rowgroup"
Since R2023b
Partition unit for parallel processing, specified as one of the values in the following table.
In a parallel processing workflow (Parallel Computing Toolbox), PartitionMethod
determines the amount of data to send
to each parallel worker. The amount of data to send to each worker is approximately
calculated by the total number of partition units divided by the number of parallel
workers. In a serial workflow, the PartitionMethod
name-value
argument is ignored.
Value | Description |
---|---|
| parquetDatastore selects a partition unit based on the
ReadSize name-value argument to balance the workload
between parallel workers. |
| Partitions are based on the total number of files. |
"bytes" | Partitions are based on the number of bytes specified by the
BlockSize property. |
"rowgroup" | Partitions are based on the total number of row groups. |
Granularity and speed of processing depend on the combination of
PartitionMethod
and ReadSize
values. While
PartitionMethod
determines how much data to send to each parallel
worker, ReadSize
determines how much data to read per read step. This
table shows supported PartitionMethod
and ReadSize
combinations and their relative granularities and partitioning times.
Granularity, Partitioning Time | PartitionMethod | ReadSize |
---|---|---|
High granularity, long partitioning time | rowgroup | rowgroup |
rowgroup | positive integer | |
Moderate granularity, moderate partitioning time | bytes | rowgroup |
Low granularity, short partitioning time | file | file |
BlockSize
— Number of bytes per partition
128000000
(default) | positive integer
Since R2023b
Number of bytes per partition, specified as a positive integer. Specify this
argument if PartitionMethod
is "bytes"
. By
default, the value of BlockSize
is 128000000
bytes
(128 MB).
Example: BlockSize=1000000
VariableNames
— Names of variables
character vector | cell array of character vectors | string scalar | string array
Names of variables in the datastore, specified as a character vector, cell array of character
vectors, string scalar, or string array. Specify the variable names in the order in
which they appear in the files. If you do not specify the variable names, the datastore
detects them from the first nonheader line in the first file. You can specify
VariableNames
with a character vector or string scalar, however
the datastore converts and stores the property value to a cell array of character
vectors. When modifying the VariableNames
property, the number of new
variable names must match the number of original variable names.
To support invalid MATLAB identifiers as variable names, such as variable names containing spaces
and non-ASCII characters, set the value of the VariableNamingRule
parameter to "preserve"
.
If ReadVariableNames
is false
, then
VariableNames
defaults to ["Var1","Var2",
...]
.
Example: ["Time","Date","Quantity"]
Data Types: char
| cell
| string
SelectedVariableNames
— Variables to read
cell array of character vectors | string array
Variables to read from the file, specified as a cell array of character vectors or a string array, where each character vector or string contains the name of one variable. You can specify the variable names in any order.
To support invalid MATLAB identifiers as variable names, such as variable names
containing spaces and non-ASCII characters, set the value of the
VariableNamingRule
parameter to
"preserve"
.
Example: ["Var3","Var7","Var4"]
Data Types: cell
| string
RowTimes
— Name of row times variable
variable name | variable index
Name of row times variable, specified as the name-value argument consisting of
"RowTimes"
and a variable name (such as
"Date"
) or a variable index (such as 3
).
RowTimes
is a timetable-related parameter. Each row of a timetable is
associated with a time, which is captured in a time vector for the timetable. The
variable specified in RowTimes
must contain a
datetime
or a duration
vector.
If the value of "OutputType"
is "timetable"
, but you do
not specify "RowTimes"
, then ParquetDatastore
uses the
first datetime
or duration
variable as the row
times for the timetable.
SupportedOutputFormats
— Formats supported for writing
string row vector
This property is read-only.
Formats supported for writing, returned as a row vector of strings. This property
specifies the possible output formats when using writeall
to write output files from the datastore.
DefaultOutputFormat
— Default output format
string scalar
This property is read-only.
Default output format, returned as a string scalar. This property specifies the default format
when using writeall
to write output files from the datastore.
Data Types: string
Object Functions
hasdata | Determine if data is available to read |
numpartitions | Number of datastore partitions |
partition | Partition a datastore |
preview | Preview subset of data in datastore |
read | Read data in datastore |
readall | Read all data in datastore |
writeall | Write datastore to files |
reset | Reset datastore to initial state |
transform | Transform datastore |
combine | Combine data from multiple datastores |
isPartitionable | Determine whether datastore is partitionable |
isSubsettable | Determine whether datastore is subsettable |
isShuffleable | Determine whether datastore is shuffleable |
Examples
Create parquetDatastore Object
Create a parquetDatastore object using either a FileSet object or a file path.
Create a FileSet
object containing the file
outages.parquet
. Create a parquetDatastore
object.
fs = matlab.io.datastore.FileSet("outages.parquet");
pds = parquetDatastore(fs)
pds = ParquetDatastore with properties: Files: { '...\matlab\toolbox\matlab\demos\outages.parquet' } Folders: { '...\matlab\toolbox\matlab\demos' } VariableNames: {1x6 cell} SelectedVariableNames: {1x6 cell} ReadSize: 'rowgroup' OutputType: 'table' RowTimes: [] AlternateFileSystemRoots: {} SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" ... ] DefaultOutputFormat: "parquet" VariableNamingRule: 'modify'
Alternatively, you can use a file path to create your
parquetDatastore
object.
pds = parquetDatastore("outages.parquet");
Specify Read Size for ParquetDatastore
Create a datastore for a sample Parquet file, and then read data from the file with different ReadSize
values.
Create a datastore for outages.parquet
, set ReadSize
to 10
rows, and then read from the datastore. The value of ReadSize
determines how many rows of data are read from the datastore with each call to the read
function.
pds = parquetDatastore("outages.parquet","ReadSize",10); read(pds)
ans=10×6 table
Region OutageTime Loss Customers RestorationTime Cause
___________ ____________________ ______ __________ ____________________ _________________
"SouthWest" 01-Feb-2002 12:18:00 458.98 1.8202e+06 07-Feb-2002 16:50:00 "winter storm"
"SouthEast" 23-Jan-2003 00:49:00 530.14 2.1204e+05 NaT "winter storm"
"SouthEast" 07-Feb-2003 21:15:00 289.4 1.4294e+05 17-Feb-2003 08:14:00 "winter storm"
"West" 06-Apr-2004 05:44:00 434.81 3.4037e+05 06-Apr-2004 06:10:00 "equipment fault"
"MidWest" 16-Mar-2002 06:18:00 186.44 2.1275e+05 18-Mar-2002 23:23:00 "severe storm"
"West" 18-Jun-2003 02:49:00 0 0 18-Jun-2003 10:54:00 "attack"
"West" 20-Jun-2004 14:39:00 231.29 NaN 20-Jun-2004 19:16:00 "equipment fault"
"West" 06-Jun-2002 19:28:00 311.86 NaN 07-Jun-2002 00:51:00 "equipment fault"
"NorthEast" 16-Jul-2003 16:23:00 239.93 49434 17-Jul-2003 01:12:00 "fire"
"MidWest" 27-Sep-2004 11:09:00 286.72 66104 27-Sep-2004 16:37:00 "equipment fault"
Set the ReadSize
property value to "file"
and read from the datastore. Every call to the read
function reads all the data from the datastore.
pds.ReadSize ="file";
data = read(pds)
data=1468×6 table
Region OutageTime Loss Customers RestorationTime Cause
___________ ____________________ ______ __________ ____________________ _________________
"SouthWest" 01-Feb-2002 12:18:00 458.98 1.8202e+06 07-Feb-2002 16:50:00 "winter storm"
"SouthEast" 23-Jan-2003 00:49:00 530.14 2.1204e+05 NaT "winter storm"
"SouthEast" 07-Feb-2003 21:15:00 289.4 1.4294e+05 17-Feb-2003 08:14:00 "winter storm"
"West" 06-Apr-2004 05:44:00 434.81 3.4037e+05 06-Apr-2004 06:10:00 "equipment fault"
"MidWest" 16-Mar-2002 06:18:00 186.44 2.1275e+05 18-Mar-2002 23:23:00 "severe storm"
"West" 18-Jun-2003 02:49:00 0 0 18-Jun-2003 10:54:00 "attack"
"West" 20-Jun-2004 14:39:00 231.29 NaN 20-Jun-2004 19:16:00 "equipment fault"
"West" 06-Jun-2002 19:28:00 311.86 NaN 07-Jun-2002 00:51:00 "equipment fault"
"NorthEast" 16-Jul-2003 16:23:00 239.93 49434 17-Jul-2003 01:12:00 "fire"
"MidWest" 27-Sep-2004 11:09:00 286.72 66104 27-Sep-2004 16:37:00 "equipment fault"
"SouthEast" 05-Sep-2004 17:48:00 73.387 36073 05-Sep-2004 20:46:00 "equipment fault"
"West" 21-May-2004 21:45:00 159.99 NaN 22-May-2004 04:23:00 "equipment fault"
"SouthEast" 01-Sep-2002 18:22:00 95.917 36759 01-Sep-2002 19:12:00 "severe storm"
"SouthEast" 27-Sep-2003 07:32:00 NaN 3.5517e+05 04-Oct-2003 07:02:00 "severe storm"
"West" 12-Nov-2003 06:12:00 254.09 9.2429e+05 17-Nov-2003 02:04:00 "winter storm"
"NorthEast" 18-Sep-2004 05:54:00 0 0 NaT "equipment fault"
⋮
You also can set the value of ReadSize
property to "rowgroup"
. For more information, see the ReadSize
property of the ParquetDatastore
object reference page.
Return Timetable from Parquet Datastore
Use the OutputType
and RowTimes
name-value pairs to make ParquetDatastore
return timetables instead of tables.
Create a datastore for airlinesmall.parquet
. Specify the "OutputType
" name-value argument as "timetable
".
pds = parquetDatastore("airlinesmall.parquet","OutputType","timetable"); preview(pds)
ans=12500×26 timetable
Date DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
___________ _________ ____________________ ____________________ ____________________ ____________________ _____________ _________ _______ _________________ ______________ _______ ________ ________ ______ _____ ________ _______ _______ _________ ________________ ________ ____________ ____________ ________ _____________ _________________
21-Oct-1987 3 21-Oct-1987 06:42:00 21-Oct-1987 06:30:00 21-Oct-1987 07:35:00 21-Oct-1987 07:27:00 "PS" 1503 "NA" 3180 sec 3420 sec NaN sec 480 sec 720 sec "LAX" "SJC" 308 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
26-Oct-1987 1 26-Oct-1987 10:21:00 26-Oct-1987 10:20:00 26-Oct-1987 11:24:00 26-Oct-1987 11:16:00 "PS" 1550 "NA" 3780 sec 3360 sec NaN sec 480 sec 60 sec "SJC" "BUR" 296 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
23-Oct-1987 5 23-Oct-1987 20:55:00 23-Oct-1987 20:35:00 23-Oct-1987 22:18:00 23-Oct-1987 21:57:00 "PS" 1589 "NA" 4980 sec 4920 sec NaN sec 1260 sec 1200 sec "SAN" "SMF" 480 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
23-Oct-1987 5 23-Oct-1987 13:32:00 23-Oct-1987 13:20:00 23-Oct-1987 14:31:00 23-Oct-1987 14:18:00 "PS" 1655 "NA" 3540 sec 3480 sec NaN sec 780 sec 720 sec "BUR" "SJC" 296 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
22-Oct-1987 4 22-Oct-1987 06:29:00 22-Oct-1987 06:30:00 22-Oct-1987 07:46:00 22-Oct-1987 07:42:00 "PS" 1702 "NA" 4620 sec 4320 sec NaN sec 240 sec -60 sec "SMF" "LAX" 373 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
28-Oct-1987 3 28-Oct-1987 14:46:00 28-Oct-1987 13:43:00 28-Oct-1987 15:47:00 28-Oct-1987 14:48:00 "PS" 1729 "NA" 3660 sec 3900 sec NaN sec 3540 sec 3780 sec "LAX" "SJC" 308 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
08-Oct-1987 4 08-Oct-1987 09:28:00 08-Oct-1987 09:30:00 08-Oct-1987 10:52:00 08-Oct-1987 10:49:00 "PS" 1763 "NA" 5040 sec 4740 sec NaN sec 180 sec -120 sec "SAN" "SFO" 447 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
10-Oct-1987 6 10-Oct-1987 08:59:00 10-Oct-1987 09:00:00 10-Oct-1987 11:34:00 10-Oct-1987 11:23:00 "PS" 1800 "NA" 9300 sec 8580 sec NaN sec 660 sec -60 sec "SEA" "LAX" 954 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
20-Oct-1987 2 20-Oct-1987 18:33:00 20-Oct-1987 18:30:00 20-Oct-1987 19:29:00 20-Oct-1987 19:26:00 "PS" 1831 "NA" 3360 sec 3360 sec NaN sec 180 sec 180 sec "LAX" "SJC" 308 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
15-Oct-1987 4 15-Oct-1987 10:41:00 15-Oct-1987 10:40:00 15-Oct-1987 11:57:00 15-Oct-1987 11:55:00 "PS" 1864 "NA" 4560 sec 4500 sec NaN sec 120 sec 60 sec "SFO" "LAS" 414 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
15-Oct-1987 4 15-Oct-1987 16:08:00 15-Oct-1987 15:53:00 15-Oct-1987 16:56:00 15-Oct-1987 16:40:00 "PS" 1907 "NA" 2880 sec 2820 sec NaN sec 960 sec 900 sec "LAX" "FAT" 209 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
21-Oct-1987 3 21-Oct-1987 09:49:00 21-Oct-1987 09:40:00 21-Oct-1987 10:55:00 21-Oct-1987 10:52:00 "PS" 1939 "NA" 3960 sec 4320 sec NaN sec 180 sec 540 sec "LGB" "SFO" 354 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
22-Oct-1987 4 22-Oct-1987 19:02:00 22-Oct-1987 18:47:00 22-Oct-1987 20:30:00 22-Oct-1987 19:51:00 "PS" 1973 "NA" 5280 sec 3840 sec NaN sec 2340 sec 900 sec "LAX" "OAK" 337 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
16-Oct-1987 5 16-Oct-1987 19:10:00 16-Oct-1987 18:38:00 16-Oct-1987 20:52:00 16-Oct-1987 19:55:00 "TW" 19 "NA" 9720 sec 8220 sec NaN sec 3420 sec 1920 sec "STL" "DEN" 770 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
02-Oct-1987 5 02-Oct-1987 11:30:00 02-Oct-1987 11:33:00 02-Oct-1987 12:37:00 02-Oct-1987 12:37:00 "TW" 59 "NA" 11220 sec 11040 sec NaN sec 0 sec -180 sec "STL" "PHX" 1262 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
30-Oct-1987 5 30-Oct-1987 14:00:00 30-Oct-1987 14:00:00 30-Oct-1987 19:20:00 30-Oct-1987 19:34:00 "TW" 102 "NA" 12000 sec 12840 sec NaN sec -840 sec 0 sec "SNA" "STL" 1570 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
⋮
When you do not also specify "RowTimes
", parquetDatastore
uses the first datetime or duration variable as the row times. In this case, the Date
variable is used for the row times.
Specify the "RowTimes
" option to use the arrival times (ArrTime
) as the row times, instead of the flight dates.
pds = parquetDatastore("airlinesmall.parquet","OutputType","timetable","RowTimes","ArrTime"); preview(pds)
ans=12500×26 timetable
ArrTime Date DayOfWeek DepTime CRSDepTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
____________________ ___________ _________ ____________________ ____________________ ____________________ _____________ _________ _______ _________________ ______________ _______ ________ ________ ______ _____ ________ _______ _______ _________ ________________ ________ ____________ ____________ ________ _____________ _________________
21-Oct-1987 07:35:00 21-Oct-1987 3 21-Oct-1987 06:42:00 21-Oct-1987 06:30:00 21-Oct-1987 07:27:00 "PS" 1503 "NA" 3180 sec 3420 sec NaN sec 480 sec 720 sec "LAX" "SJC" 308 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
26-Oct-1987 11:24:00 26-Oct-1987 1 26-Oct-1987 10:21:00 26-Oct-1987 10:20:00 26-Oct-1987 11:16:00 "PS" 1550 "NA" 3780 sec 3360 sec NaN sec 480 sec 60 sec "SJC" "BUR" 296 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
23-Oct-1987 22:18:00 23-Oct-1987 5 23-Oct-1987 20:55:00 23-Oct-1987 20:35:00 23-Oct-1987 21:57:00 "PS" 1589 "NA" 4980 sec 4920 sec NaN sec 1260 sec 1200 sec "SAN" "SMF" 480 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
23-Oct-1987 14:31:00 23-Oct-1987 5 23-Oct-1987 13:32:00 23-Oct-1987 13:20:00 23-Oct-1987 14:18:00 "PS" 1655 "NA" 3540 sec 3480 sec NaN sec 780 sec 720 sec "BUR" "SJC" 296 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
22-Oct-1987 07:46:00 22-Oct-1987 4 22-Oct-1987 06:29:00 22-Oct-1987 06:30:00 22-Oct-1987 07:42:00 "PS" 1702 "NA" 4620 sec 4320 sec NaN sec 240 sec -60 sec "SMF" "LAX" 373 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
28-Oct-1987 15:47:00 28-Oct-1987 3 28-Oct-1987 14:46:00 28-Oct-1987 13:43:00 28-Oct-1987 14:48:00 "PS" 1729 "NA" 3660 sec 3900 sec NaN sec 3540 sec 3780 sec "LAX" "SJC" 308 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
08-Oct-1987 10:52:00 08-Oct-1987 4 08-Oct-1987 09:28:00 08-Oct-1987 09:30:00 08-Oct-1987 10:49:00 "PS" 1763 "NA" 5040 sec 4740 sec NaN sec 180 sec -120 sec "SAN" "SFO" 447 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
10-Oct-1987 11:34:00 10-Oct-1987 6 10-Oct-1987 08:59:00 10-Oct-1987 09:00:00 10-Oct-1987 11:23:00 "PS" 1800 "NA" 9300 sec 8580 sec NaN sec 660 sec -60 sec "SEA" "LAX" 954 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
20-Oct-1987 19:29:00 20-Oct-1987 2 20-Oct-1987 18:33:00 20-Oct-1987 18:30:00 20-Oct-1987 19:26:00 "PS" 1831 "NA" 3360 sec 3360 sec NaN sec 180 sec 180 sec "LAX" "SJC" 308 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
15-Oct-1987 11:57:00 15-Oct-1987 4 15-Oct-1987 10:41:00 15-Oct-1987 10:40:00 15-Oct-1987 11:55:00 "PS" 1864 "NA" 4560 sec 4500 sec NaN sec 120 sec 60 sec "SFO" "LAS" 414 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
15-Oct-1987 16:56:00 15-Oct-1987 4 15-Oct-1987 16:08:00 15-Oct-1987 15:53:00 15-Oct-1987 16:40:00 "PS" 1907 "NA" 2880 sec 2820 sec NaN sec 960 sec 900 sec "LAX" "FAT" 209 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
21-Oct-1987 10:55:00 21-Oct-1987 3 21-Oct-1987 09:49:00 21-Oct-1987 09:40:00 21-Oct-1987 10:52:00 "PS" 1939 "NA" 3960 sec 4320 sec NaN sec 180 sec 540 sec "LGB" "SFO" 354 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
22-Oct-1987 20:30:00 22-Oct-1987 4 22-Oct-1987 19:02:00 22-Oct-1987 18:47:00 22-Oct-1987 19:51:00 "PS" 1973 "NA" 5280 sec 3840 sec NaN sec 2340 sec 900 sec "LAX" "OAK" 337 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
16-Oct-1987 20:52:00 16-Oct-1987 5 16-Oct-1987 19:10:00 16-Oct-1987 18:38:00 16-Oct-1987 19:55:00 "TW" 19 "NA" 9720 sec 8220 sec NaN sec 3420 sec 1920 sec "STL" "DEN" 770 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
02-Oct-1987 12:37:00 02-Oct-1987 5 02-Oct-1987 11:30:00 02-Oct-1987 11:33:00 02-Oct-1987 12:37:00 "TW" 59 "NA" 11220 sec 11040 sec NaN sec 0 sec -180 sec "STL" "PHX" 1262 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
30-Oct-1987 19:20:00 30-Oct-1987 5 30-Oct-1987 14:00:00 30-Oct-1987 14:00:00 30-Oct-1987 19:34:00 "TW" 102 "NA" 12000 sec 12840 sec NaN sec -840 sec 0 sec "SNA" "STL" 1570 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
⋮
Conditionally Select Rows Using Row Filter
Conditionally select rows from a data set using the RowFilter
property.
Create a Parquet datastore using the outages.parquet
file. View the first 8 rows of the datastore.
pds = parquetDatastore("outages.parquet");
preview(pds)
ans=8×6 table
Region OutageTime Loss Customers RestorationTime Cause
___________ ____________________ ______ __________ ____________________ _________________
"SouthWest" 01-Feb-2002 12:18:00 458.98 1.8202e+06 07-Feb-2002 16:50:00 "winter storm"
"SouthEast" 23-Jan-2003 00:49:00 530.14 2.1204e+05 NaT "winter storm"
"SouthEast" 07-Feb-2003 21:15:00 289.4 1.4294e+05 17-Feb-2003 08:14:00 "winter storm"
"West" 06-Apr-2004 05:44:00 434.81 3.4037e+05 06-Apr-2004 06:10:00 "equipment fault"
"MidWest" 16-Mar-2002 06:18:00 186.44 2.1275e+05 18-Mar-2002 23:23:00 "severe storm"
"West" 18-Jun-2003 02:49:00 0 0 18-Jun-2003 10:54:00 "attack"
"West" 20-Jun-2004 14:39:00 231.29 NaN 20-Jun-2004 19:16:00 "equipment fault"
"West" 06-Jun-2002 19:28:00 311.86 NaN 07-Jun-2002 00:51:00 "equipment fault"
Create a row filter that identifies rows with a Region
of "NorthEast"
and a Cause
of "winter storm"
. Then, set the RowFilter
property of the datastore to the filter. Preview the datastore, note that the datastore contains only rows that meet the filter conditions.
rf = rowfilter(pds); filter = rf.Region == "NorthEast" & rf.Cause == "winter storm"; pds.RowFilter = filter; preview(pds)
ans=8×6 table
Region OutageTime Loss Customers RestorationTime Cause
___________ ____________________ ______ __________ ____________________ ______________
"NorthEast" 13-Nov-2004 10:42:00 NaN 1.4227e+05 19-Nov-2004 02:31:00 "winter storm"
"NorthEast" 26-Dec-2004 22:18:00 255.45 1.0444e+05 27-Dec-2004 14:11:00 "winter storm"
"NorthEast" 17-Dec-2003 15:11:00 NaN 66692 19-Dec-2003 07:22:00 "winter storm"
"NorthEast" 28-Jan-2005 18:20:00 401.39 89683 29-Jan-2005 02:36:00 "winter storm"
"NorthEast" 04-Feb-2005 00:53:00 32.061 46182 09-Feb-2005 02:42:00 "winter storm"
"NorthEast" 16-Nov-2006 10:04:00 147.25 1.2571e+05 17-Nov-2006 10:55:00 "winter storm"
"NorthEast" 03-Feb-2007 02:19:00 293.83 1.1628e+05 04-Feb-2007 21:24:00 "winter storm"
"NorthEast" 18-Feb-2008 05:24:00 353.29 64687 20-Feb-2008 08:56:00 "winter storm"
Limitations
If you use
parquetread
orparquetDatastore
to read the files, then the result might not have the same format or contents as the original table. For more information, see Apache Parquet Data Type Mappings.Unlike
parquetread
, which replaces NULL values with doubles,parquetDatastore
replaces NULL integer values with0
and NULL boolean values withfalse
. This replacement results in a lossy transformation.
Extended Capabilities
Thread-Based Environment
Run code in the background using MATLAB® backgroundPool
or accelerate code with Parallel Computing Toolbox™ ThreadPool
.
This function fully supports thread-based environments. For more information, see Run MATLAB Functions in Thread-Based Environment.
Version History
Introduced in R2019aR2023b: Create ParquetDatastore
more efficiently with partition control in parallel environments
In parallel environments, you can create a ParquetDatastore
more
efficiently by specifying the unit of partition and the size of partition blocks. Specify
the PartitionMethod
and Blocksize
name-value arguments
during creation of the datastore.
R2022b: Read Parquet files containing structured data
Read structured data from Parquet files as nested tables.
R2022b: Use function in thread-based environments
This function supports thread-based environments.
R2022a: Read Parquet file data more efficiently using rowfilter
to
conditionally filter rows
Conditionally filter and read data faster (Predicate Pushdown) from Parquet files when
using parquetread
and parquetDatastore
. You can create
conditions for filtering by using the rowfilter
function,
matlab.io.RowFilter
object, and RowFilter
name-value
argument.
R2022a: Specify FileSet objects as data locations
parquetDatastore
accepts FileSet
objects as the
locations of files to include in the datastore. FileSet
objects provide
increased performance compared to file paths or DsFileSet
objects.
R2021a: Use categorical data in Parquet data format
Use Parquet data that contains the categorical
data type.
R2019b: Write tabular data containing any characters
Use tabular data that has variable names containing any Unicode characters, including spaces and non-ASCII characters.
See Also
mapreduce
| tall
| parquetread
| parquetinfo
| rowfilter
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)