Clean Messy and Missing Data in Tables
When a table is messy and has missing data values, you can clean the table in several ways. First, you can identify missing data by using the Import Tool or by using functions such as the summary
and ismissing
functions. After you identify missing values, you can standardize, fill, or remove missing values by using the standardizeMissing
, fillmissing
, or rmmissing
functions. Then you can clean your table further by reorganizing it. You can use functions such as the sortrows
or movevars
functions to rearrange table rows and variables in the orders that suit you.
Examine Data in File
Examine the data in the sample comma-separated value (CSV) file, messy.csv
. One way to examine the data is to use the Import Tool. It previews the data and enables you to specify how to import the data. The Import Tool shows that messy.csv
has five columns. Some columns have text and the other columns have numeric values.
The file contains many different missing data indicators:
Empty text
period (.)
NA
NaN
-99
The Import Tool automatically recognizes some missing data indicators, such as NaN
in numeric columns and empty text in text columns.
The tool also highlights other indicators such as the empty text, period, and NA
that occur in column B
. These values are not standard missing values. But as nonnumeric values in a numeric column, it is likely that they are meant to represent missing values. When you import the data, you can specify that these values should also be treated as though they are NaN
s.
The value of -99
could also be taken as indicating missing values. When numeric data otherwise consists of positive values, some people sometimes specify a single negative value as a flag for missing data. If a number such as -99
stands for missing data in your table, then you must specify that it is a missing value when you clean your table.
Import Data as Table
You can import the data into the MATLAB® workspace from the Import Tool. But you can also use the readtable
function to read the data from a file and import it as a table.
Import the data in messy.csv
using the readtable
function. To read text data into table variables that are string arrays, use the "TextType"
name-value argument. To treat specified nonnumeric values in numeric columns as missing values, use the "TreatAsMissing"
name-value argument. In table variables B
, D
, and E
, readtable
imports empty text, .
, and NA
as NaN
values. However, the values that are -99
remain unchanged because they are numeric.
Also, readtable
treats A
and C
as text. So, it turns empty text in those columns into missing strings, which display as <missing>
.
messyTable = readtable("messy.csv","TextType","string","TreatAsMissing",[".","NA"])
messyTable=21×5 table
A B C D E
______ ____ _________ ____ ____
"afe1" NaN "yes" 3 3
"egh3" NaN "no" 7 7
"wth4" -99 "yes" -99 -99
"atn2" 23.7 <missing> NaN 23.7
"arg1" NaN "yes" 5 NaN
"jre3" 34.6 "yes" 34.6 34.6
"wen9" 234 "yes" 234 234
"ple2" 2 "no" 2 2
"dbo8" 5 "no" 5 5
"oii4" 5 "yes" 5 5
"wnk3" 245 "yes" 245 245
"abk6" 563 "no" 563 563
"pnj5" 463 "no" 463 463
"wnn3" 6 "no" 6 6
"oks9" 23 "yes" 23 23
"wba3" 14 "yes" 14 14
⋮
View Summary of Table
To view a summary of the table, use the summary
function. For each table variable, it shows the data type and other descriptive statistics. For example, summary
shows the number of missing values in each numeric variable of messyTable
.
summary(messyTable)
Variables: A: 21x1 string B: 21x1 double Values: Min -99 Median 22.5 Max 563 NumMissing 3 C: 21x1 string D: 21x1 double Values: Min -99 Median 14 Max 563 NumMissing 2 E: 21x1 double Values: Min -99 Median 21.5 Max 563 NumMissing 1
Find Rows with Missing Values
To find the rows of messyTable
that have at least one missing value, use the ismissing
function. If you have nonstandard missing values in your data, such as -99
, you can specify it along with the standard missing values.
The output of ismissing
is a logical array that identifies the elements of messyTable
that have missing values.
missingElements = ismissing(messyTable,{string(missing),NaN,-99})
missingElements = 21x5 logical array
0 1 0 0 0
0 1 0 0 0
0 1 0 1 1
0 0 1 1 0
0 1 0 0 1
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
⋮
To create a logical vector that identifies rows that have missing values, use the any
function.
rowsWithMissingValues = any(missingElements,2)
rowsWithMissingValues = 21x1 logical array
1
1
1
1
1
0
0
0
0
0
⋮
To index into the table and return only the rows that have missing values, use the logical vector rowsWithMissingValues
.
missingValuesTable = messyTable(rowsWithMissingValues,:)
missingValuesTable=6×5 table
A B C D E
______ ____ _________ ___ ____
"afe1" NaN "yes" 3 3
"egh3" NaN "no" 7 7
"wth4" -99 "yes" -99 -99
"atn2" 23.7 <missing> NaN 23.7
"arg1" NaN "yes" 5 NaN
"gry5" 21 "yes" NaN 21
Fill Missing Values
One strategy for cleaning the missing values in a table is to replace them with more meaningful values. You can replace nonstandard missing values by inserting standard missing values. And you can fill missing values with adjusted values. For example, you can fill missing values with their nearest neighbors, or with the mean value of a table variable.
In this example, -99
is a nonstandard value for indicating a missing value. To replace the instances of -99
with standard missing values, use the standardizeMissing
function. NaN
is the standard missing value for single- and double-precision floating-point numeric arrays.
messyTable = standardizeMissing(messyTable,-99)
messyTable=21×5 table
A B C D E
______ ____ _________ ____ ____
"afe1" NaN "yes" 3 3
"egh3" NaN "no" 7 7
"wth4" NaN "yes" NaN NaN
"atn2" 23.7 <missing> NaN 23.7
"arg1" NaN "yes" 5 NaN
"jre3" 34.6 "yes" 34.6 34.6
"wen9" 234 "yes" 234 234
"ple2" 2 "no" 2 2
"dbo8" 5 "no" 5 5
"oii4" 5 "yes" 5 5
"wnk3" 245 "yes" 245 245
"abk6" 563 "no" 563 563
"pnj5" 463 "no" 463 463
"wnn3" 6 "no" 6 6
"oks9" 23 "yes" 23 23
"wba3" 14 "yes" 14 14
⋮
To fill missing values, use the fillmissing
function. It provides many methods that fill missing values. For example, return a new table, filledTable
, where you fill missing values with their nearest neighbors that are not missing values.
filledTable = fillmissing(messyTable,"nearest")
filledTable=21×5 table
A B C D E
______ ____ _____ ____ ____
"afe1" 23.7 "yes" 3 3
"egh3" 23.7 "no" 7 7
"wth4" 23.7 "yes" 7 23.7
"atn2" 23.7 "yes" 5 23.7
"arg1" 34.6 "yes" 5 34.6
"jre3" 34.6 "yes" 34.6 34.6
"wen9" 234 "yes" 234 234
"ple2" 2 "no" 2 2
"dbo8" 5 "no" 5 5
"oii4" 5 "yes" 5 5
"wnk3" 245 "yes" 245 245
"abk6" 563 "no" 563 563
"pnj5" 463 "no" 463 463
"wnn3" 6 "no" 6 6
"oks9" 23 "yes" 23 23
"wba3" 14 "yes" 14 14
⋮
Remove Rows with Missing Values
Another strategy for cleaning the missing values in a table is to remove the rows that have them.
To remove rows that have missing values, use the rmmissing
function.
remainingTable = rmmissing(messyTable)
remainingTable=15×5 table
A B C D E
______ ____ _____ ____ ____
"jre3" 34.6 "yes" 34.6 34.6
"wen9" 234 "yes" 234 234
"ple2" 2 "no" 2 2
"dbo8" 5 "no" 5 5
"oii4" 5 "yes" 5 5
"wnk3" 245 "yes" 245 245
"abk6" 563 "no" 563 563
"pnj5" 463 "no" 463 463
"wnn3" 6 "no" 6 6
"oks9" 23 "yes" 23 23
"wba3" 14 "yes" 14 14
"pkn4" 2 "no" 2 2
"adw3" 22 "no" 22 22
"poj2" 34.6 "yes" 34.6 34.6
"bas8" 23 "no" 23 23
Sort Table Rows
Once you have cleaned a table, you can organize it other ways. For example, you can sort the rows of a table by the values in one or more variables.
Sort the rows by the values in the first variable, A
.
sortedTable = sortrows(remainingTable)
sortedTable=15×5 table
A B C D E
______ ____ _____ ____ ____
"abk6" 563 "no" 563 563
"adw3" 22 "no" 22 22
"bas8" 23 "no" 23 23
"dbo8" 5 "no" 5 5
"jre3" 34.6 "yes" 34.6 34.6
"oii4" 5 "yes" 5 5
"oks9" 23 "yes" 23 23
"pkn4" 2 "no" 2 2
"ple2" 2 "no" 2 2
"pnj5" 463 "no" 463 463
"poj2" 34.6 "yes" 34.6 34.6
"wba3" 14 "yes" 14 14
"wen9" 234 "yes" 234 234
"wnk3" 245 "yes" 245 245
"wnn3" 6 "no" 6 6
Sort the rows in descending order by C
, and then sort in ascending order by A
.
sortedBy2Vars = sortrows(remainingTable,["C","A"],["descend","ascend"])
sortedBy2Vars=15×5 table
A B C D E
______ ____ _____ ____ ____
"jre3" 34.6 "yes" 34.6 34.6
"oii4" 5 "yes" 5 5
"oks9" 23 "yes" 23 23
"poj2" 34.6 "yes" 34.6 34.6
"wba3" 14 "yes" 14 14
"wen9" 234 "yes" 234 234
"wnk3" 245 "yes" 245 245
"abk6" 563 "no" 563 563
"adw3" 22 "no" 22 22
"bas8" 23 "no" 23 23
"dbo8" 5 "no" 5 5
"pkn4" 2 "no" 2 2
"ple2" 2 "no" 2 2
"pnj5" 463 "no" 463 463
"wnn3" 6 "no" 6 6
Sorting by C
, the rows are grouped first by "yes"
, followed by "no"
. Then sorting by A
, the rows are listed alphabetically.
To reorder the table so that A
and C
are next to each other, use the movevars
function.
sortedRowsAndMovedVars = movevars(sortedBy2Vars,"C","After","A")
sortedRowsAndMovedVars=15×5 table
A C B D E
______ _____ ____ ____ ____
"jre3" "yes" 34.6 34.6 34.6
"oii4" "yes" 5 5 5
"oks9" "yes" 23 23 23
"poj2" "yes" 34.6 34.6 34.6
"wba3" "yes" 14 14 14
"wen9" "yes" 234 234 234
"wnk3" "yes" 245 245 245
"abk6" "no" 563 563 563
"adw3" "no" 22 22 22
"bas8" "no" 23 23 23
"dbo8" "no" 5 5 5
"pkn4" "no" 2 2 2
"ple2" "no" 2 2 2
"pnj5" "no" 463 463 463
"wnn3" "no" 6 6 6
See Also
table
| ismissing
| standardizeMissing
| fillmissing
| rmmissing
| sortrows
| movevars
| Import Tool
| readtable
| summary