Clean Messy and Missing Data in Tables

Open Live Script

This example shows how to clean and reorganize a table that has messy and missing data values. First, you can identify missing data by using the Import Tool or by using functions such as the summary and ismissing functions. You can standardize, fill, or remove missing values by using the standardizeMissing, fillmissing, or rmmissing functions. Then you can clean your table further by reorganizing it. You can rearrange table rows and variables by using functions such as the sortrows and movevars functions.

Examine Data in File

Examine the data in the sample comma-separated value (CSV) file, messy.csv, by using the Import Tool. The tool previews the data and enables you to specify how to import the data. To examine messy.csv in the Import Tool, after opening this example in MATLAB®, go to the Home tab, click Import Data in the Variable section, and open messy.csv using the file selection dialog box.

The Import Tool shows that messy.csv has five columns with text and numeric values.

Import Tool, showing the five columns of text and numeric data in the messy.csv file

The file contains many different missing data indicators:

Empty text
Period (.)
NA
NaN
-99

The Import Tool automatically recognizes (but does not visually highlight) some missing data indicators, such as NaN in numeric columns and empty text in text columns.

The tool highlights other indicators, such as the empty text, period, and NA, that occur in column B. These values are not standard missing values. However, nonnumeric values in a numeric column are likely to represent missing values. When you import the data, you can specify that these values should also be treated as missing values.

When numeric data otherwise consists of positive values, a single negative value, such as -99, can be a flag for missing data. If a number such as -99 represents missing data in your table, then you must specify that it is a missing value when you clean your table.

Import Data as Table

You can import data into the MATLAB workspace from the Import Tool. You can also use the readtable function to read the data from a file and import it as a table.

Import the data in messy.csv using the readtable function. To read text data into table variables that are string arrays, use the TextType name-value argument. To treat specified nonnumeric values in numeric columns as missing values, use the TreatAsMissing name-value argument. For columns A and C with text data, readtable imports any empty text as missing strings, which display as <missing>. For columns B, D, and E with numeric data, readtable imports empty text as NaN values, and also imports . and NA as NaN values when you specify them using TreatAsMissing. However, the values that are -99 remain unchanged because they are numeric.

messyTable = readtable("messy.csv","TextType","string","TreatAsMissing",[".","NA"])

messyTable=21×5 table
      A        B          C         D       E  
    ______    ____    _________    ____    ____

    "afe1"     NaN    "yes"           3       3
    "egh3"     NaN    "no"            7       7
    "wth4"     -99    "yes"         -99     -99
    "atn2"    23.7    <missing>     NaN    23.7
    "arg1"     NaN    "yes"           5     NaN
    "jre3"    34.6    "yes"        34.6    34.6
    "wen9"     234    "yes"         234     234
    "ple2"       2    "no"            2       2
    "dbo8"       5    "no"            5       5
    "oii4"       5    "yes"           5       5
    "wnk3"     245    "yes"         245     245
    "abk6"     563    "no"          563     563
    "pnj5"     463    "no"          463     463
    "wnn3"       6    "no"            6       6
    "oks9"      23    "yes"          23      23
    "wba3"      14    "yes"          14      14
      ⋮

View Summary of Table

To view a summary of the table, use the summary function. The summary shows the data type and other descriptive statistics for each table variable. For example, summary shows the number of missing values in each numeric variable of messyTable.

summary(messyTable)

Variables:

    A: 21x1 string

    B: 21x1 double

        Values:

            Min             -99   
            Median          22.5  
            Max             563   
            NumMissing      3     

    C: 21x1 string

    D: 21x1 double

        Values:

            Min             -99   
            Median          14    
            Max             563   
            NumMissing      2     

    E: 21x1 double

        Values:

            Min             -99   
            Median          21.5  
            Max             563   
            NumMissing      1

Find Rows with Missing Values

To find the rows of messyTable that have at least one missing value, use the ismissing function. If you have nonstandard missing values in your data, such as -99, you can specify it along with the standard missing values.

The output of ismissing is a logical array that identifies the elements of messyTable that have missing values.

missingElements = ismissing(messyTable,{string(missing),NaN,-99})

missingElements = 21x5 logical array

   0   1   0   0   0
   0   1   0   0   0
   0   1   0   1   1
   0   0   1   1   0
   0   1   0   0   1
   0   0   0   0   0
   0   0   0   0   0
   0   0   0   0   0
   0   0   0   0   0
   0   0   0   0   0
      ⋮

To create a logical vector that identifies rows that have missing values, use the any function.

rowsWithMissingValues = any(missingElements,2)

rowsWithMissingValues = 21x1 logical array

   1
   1
   1
   1
   1
   0
   0
   0
   0
   0
      ⋮

To index into the table and return only the rows that have missing values, use the logical vector rowsWithMissingValues.

missingValuesTable = messyTable(rowsWithMissingValues,:)

missingValuesTable=6×5 table
      A        B          C         D      E  
    ______    ____    _________    ___    ____

    "afe1"     NaN    "yes"          3       3
    "egh3"     NaN    "no"           7       7
    "wth4"     -99    "yes"        -99     -99
    "atn2"    23.7    <missing>    NaN    23.7
    "arg1"     NaN    "yes"          5     NaN
    "gry5"      21    "yes"        NaN      21

Fill Missing Values

One strategy for cleaning the missing values in a table is to replace them with more meaningful values. You can replace nonstandard missing values by inserting standard missing values. Then you can fill missing values with adjusted values. For example, you can fill missing values with their nearest neighbors or with the mean value of a table variable.

In this example, -99 is a nonstandard value for indicating a missing value. To replace the instances of -99 with standard missing values, use the standardizeMissing function. NaN is the standard missing value for single- and double-precision floating-point numeric arrays.

messyTable = standardizeMissing(messyTable,-99)

messyTable=21×5 table
      A        B          C         D       E  
    ______    ____    _________    ____    ____

    "afe1"     NaN    "yes"           3       3
    "egh3"     NaN    "no"            7       7
    "wth4"     NaN    "yes"         NaN     NaN
    "atn2"    23.7    <missing>     NaN    23.7
    "arg1"     NaN    "yes"           5     NaN
    "jre3"    34.6    "yes"        34.6    34.6
    "wen9"     234    "yes"         234     234
    "ple2"       2    "no"            2       2
    "dbo8"       5    "no"            5       5
    "oii4"       5    "yes"           5       5
    "wnk3"     245    "yes"         245     245
    "abk6"     563    "no"          563     563
    "pnj5"     463    "no"          463     463
    "wnn3"       6    "no"            6       6
    "oks9"      23    "yes"          23      23
    "wba3"      14    "yes"          14      14
      ⋮

To fill missing values, use the fillmissing function. The function provides many methods that fill missing values. For example, fill missing values with their nearest neighbors that are not missing values.

filledTable = fillmissing(messyTable,"nearest")

filledTable=21×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    "afe1"    23.7    "yes"       3       3
    "egh3"    23.7    "no"        7       7
    "wth4"    23.7    "yes"       7    23.7
    "atn2"    23.7    "yes"       5    23.7
    "arg1"    34.6    "yes"       5    34.6
    "jre3"    34.6    "yes"    34.6    34.6
    "wen9"     234    "yes"     234     234
    "ple2"       2    "no"        2       2
    "dbo8"       5    "no"        5       5
    "oii4"       5    "yes"       5       5
    "wnk3"     245    "yes"     245     245
    "abk6"     563    "no"      563     563
    "pnj5"     463    "no"      463     463
    "wnn3"       6    "no"        6       6
    "oks9"      23    "yes"      23      23
    "wba3"      14    "yes"      14      14
      ⋮

Remove Rows with Missing Values

Another strategy for cleaning the missing values in a table is to remove the rows that have them.

To remove rows that have missing values, use the rmmissing function.

remainingTable = rmmissing(messyTable)

remainingTable=15×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    "jre3"    34.6    "yes"    34.6    34.6
    "wen9"     234    "yes"     234     234
    "ple2"       2    "no"        2       2
    "dbo8"       5    "no"        5       5
    "oii4"       5    "yes"       5       5
    "wnk3"     245    "yes"     245     245
    "abk6"     563    "no"      563     563
    "pnj5"     463    "no"      463     463
    "wnn3"       6    "no"        6       6
    "oks9"      23    "yes"      23      23
    "wba3"      14    "yes"      14      14
    "pkn4"       2    "no"        2       2
    "adw3"      22    "no"       22      22
    "poj2"    34.6    "yes"    34.6    34.6
    "bas8"      23    "no"       23      23

Sort and Rearrange Table Rows and Variables

Once you have cleaned the missing values in a table, you can organize it in other ways. For example, you can sort the rows of a table by the values in one or more variables.

Sort the rows by the values in the first variable, A.

sortedTable = sortrows(remainingTable)

sortedTable=15×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    "abk6"     563    "no"      563     563
    "adw3"      22    "no"       22      22
    "bas8"      23    "no"       23      23
    "dbo8"       5    "no"        5       5
    "jre3"    34.6    "yes"    34.6    34.6
    "oii4"       5    "yes"       5       5
    "oks9"      23    "yes"      23      23
    "pkn4"       2    "no"        2       2
    "ple2"       2    "no"        2       2
    "pnj5"     463    "no"      463     463
    "poj2"    34.6    "yes"    34.6    34.6
    "wba3"      14    "yes"      14      14
    "wen9"     234    "yes"     234     234
    "wnk3"     245    "yes"     245     245
    "wnn3"       6    "no"        6       6

Sort the rows in descending order by C, and then sort in ascending order by A.

sortedBy2Vars = sortrows(remainingTable,["C","A"],["descend","ascend"])

sortedBy2Vars=15×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    "jre3"    34.6    "yes"    34.6    34.6
    "oii4"       5    "yes"       5       5
    "oks9"      23    "yes"      23      23
    "poj2"    34.6    "yes"    34.6    34.6
    "wba3"      14    "yes"      14      14
    "wen9"     234    "yes"     234     234
    "wnk3"     245    "yes"     245     245
    "abk6"     563    "no"      563     563
    "adw3"      22    "no"       22      22
    "bas8"      23    "no"       23      23
    "dbo8"       5    "no"        5       5
    "pkn4"       2    "no"        2       2
    "ple2"       2    "no"        2       2
    "pnj5"     463    "no"      463     463
    "wnn3"       6    "no"        6       6

Sorting by C, the rows are grouped first by "yes", followed by "no". Then sorting by A, the rows are listed alphabetically.

To reorder the table so that A and C are next to each other, use the movevars function.

sortedRowsAndMovedVars = movevars(sortedBy2Vars,"C","After","A")

sortedRowsAndMovedVars=15×5 table
      A         C       B       D       E  
    ______    _____    ____    ____    ____

    "jre3"    "yes"    34.6    34.6    34.6
    "oii4"    "yes"       5       5       5
    "oks9"    "yes"      23      23      23
    "poj2"    "yes"    34.6    34.6    34.6
    "wba3"    "yes"      14      14      14
    "wen9"    "yes"     234     234     234
    "wnk3"    "yes"     245     245     245
    "abk6"    "no"      563     563     563
    "adw3"    "no"       22      22      22
    "bas8"    "no"       23      23      23
    "dbo8"    "no"        5       5       5
    "pkn4"    "no"        2       2       2
    "ple2"    "no"        2       2       2
    "pnj5"    "no"      463     463     463
    "wnn3"    "no"        6       6       6

You can also reorder table variables by indexing. Use smooth parentheses and a variable index that specifies the order of the variables in the output table.

sortedRowsAndMovedVars = sortedBy2Vars(:,["A","C","B","D","E"])

sortedRowsAndMovedVars=15×5 table
      A         C       B       D       E  
    ______    _____    ____    ____    ____

    "jre3"    "yes"    34.6    34.6    34.6
    "oii4"    "yes"       5       5       5
    "oks9"    "yes"      23      23      23
    "poj2"    "yes"    34.6    34.6    34.6
    "wba3"    "yes"      14      14      14
    "wen9"    "yes"     234     234     234
    "wnk3"    "yes"     245     245     245
    "abk6"    "no"      563     563     563
    "adw3"    "no"       22      22      22
    "bas8"    "no"       23      23      23
    "dbo8"    "no"        5       5       5
    "pkn4"    "no"        2       2       2
    "ple2"    "no"        2       2       2
    "pnj5"    "no"      463     463     463
    "wnn3"    "no"        6       6       6