Main Content

Clean Messy and Missing Data in Tables

When a table is messy and has missing data values, you can clean the table in several ways. First, you can identify missing data by using the Import Tool or by using functions such as the summary and ismissing functions. After you identify missing values, you can standardize, fill, or remove missing values by using the standardizeMissing, fillmissing, or rmmissing functions. Then you can clean your table further by reorganizing it. You can use functions such as the sortrows or movevars functions to rearrange table rows and variables in the orders that suit you.

Examine Data in File

Examine the data in the sample comma-separated value (CSV) file, messy.csv. One way to examine the data is to use the Import Tool. It previews the data and enables you to specify how to import the data. The Import Tool shows that messy.csv has five columns. Some columns have text and the other columns have numeric values.

messy.png

The file contains many different missing data indicators:

  • Empty text

  • period (.)

  • NA

  • NaN

  • -99

The Import Tool automatically recognizes some missing data indicators, such as NaN in numeric columns and empty text in text columns.

The tool also highlights other indicators such as the empty text, period, and NA that occur in column B. These values are not standard missing values. But as nonnumeric values in a numeric column, it is likely that they are meant to represent missing values. When you import the data, you can specify that these values should also be treated as though they are NaNs.

The value of -99 could also be taken as indicating missing values. When numeric data otherwise consists of positive values, some people sometimes specify a single negative value as a flag for missing data. If a number such as -99 stands for missing data in your table, then you must specify that it is a missing value when you clean your table.

Import Data as Table

You can import the data into the MATLAB® workspace from the Import Tool. But you can also use the readtable function to read the data from a file and import it as a table.

Import the data in messy.csv using the readtable function. To read text data into table variables that are string arrays, use the "TextType" name-value argument. To treat specified nonnumeric values in numeric columns as missing values, use the "TreatAsMissing" name-value argument. In table variables B, D, and E, readtable imports empty text, ., and NA as NaN values. However, the values that are -99 remain unchanged because they are numeric.

Also, readtable treats A and C as text. So, it turns empty text in those columns into missing strings, which display as <missing>.

messyTable = readtable("messy.csv","TextType","string","TreatAsMissing",[".","NA"])
messyTable=21×5 table
      A        B          C         D       E  
    ______    ____    _________    ____    ____

    "afe1"     NaN    "yes"           3       3
    "egh3"     NaN    "no"            7       7
    "wth4"     -99    "yes"         -99     -99
    "atn2"    23.7    <missing>     NaN    23.7
    "arg1"     NaN    "yes"           5     NaN
    "jre3"    34.6    "yes"        34.6    34.6
    "wen9"     234    "yes"         234     234
    "ple2"       2    "no"            2       2
    "dbo8"       5    "no"            5       5
    "oii4"       5    "yes"           5       5
    "wnk3"     245    "yes"         245     245
    "abk6"     563    "no"          563     563
    "pnj5"     463    "no"          463     463
    "wnn3"       6    "no"            6       6
    "oks9"      23    "yes"          23      23
    "wba3"      14    "yes"          14      14
      ⋮

View Summary of Table

To view a summary of the table, use the summary function. For each table variable, it shows the data type and other descriptive statistics. For example, summary shows the number of missing values in each numeric variable of messyTable.

summary(messyTable)
Variables:

    A: 21x1 string

    B: 21x1 double

        Values:

            Min             -99   
            Median          22.5  
            Max             563   
            NumMissing      3     

    C: 21x1 string

    D: 21x1 double

        Values:

            Min             -99   
            Median          14    
            Max             563   
            NumMissing      2     

    E: 21x1 double

        Values:

            Min             -99   
            Median          21.5  
            Max             563   
            NumMissing      1     

Find Rows with Missing Values

To find the rows of messyTable that have at least one missing value, use the ismissing function. If you have nonstandard missing values in your data, such as -99, you can specify it along with the standard missing values.

The output of ismissing is a logical array that identifies the elements of messyTable that have missing values.

missingElements = ismissing(messyTable,{string(missing),NaN,-99})
missingElements = 21x5 logical array

   0   1   0   0   0
   0   1   0   0   0
   0   1   0   1   1
   0   0   1   1   0
   0   1   0   0   1
   0   0   0   0   0
   0   0   0   0   0
   0   0   0   0   0
   0   0   0   0   0
   0   0   0   0   0
      ⋮

To create a logical vector that identifies rows that have missing values, use the any function.

rowsWithMissingValues = any(missingElements,2)
rowsWithMissingValues = 21x1 logical array

   1
   1
   1
   1
   1
   0
   0
   0
   0
   0
      ⋮

To index into the table and return only the rows that have missing values, use the logical vector rowsWithMissingValues.

missingValuesTable = messyTable(rowsWithMissingValues,:)
missingValuesTable=6×5 table
      A        B          C         D      E  
    ______    ____    _________    ___    ____

    "afe1"     NaN    "yes"          3       3
    "egh3"     NaN    "no"           7       7
    "wth4"     -99    "yes"        -99     -99
    "atn2"    23.7    <missing>    NaN    23.7
    "arg1"     NaN    "yes"          5     NaN
    "gry5"      21    "yes"        NaN      21

Fill Missing Values

One strategy for cleaning the missing values in a table is to replace them with more meaningful values. You can replace nonstandard missing values by inserting standard missing values. And you can fill missing values with adjusted values. For example, you can fill missing values with their nearest neighbors, or with the mean value of a table variable.

In this example, -99 is a nonstandard value for indicating a missing value. To replace the instances of -99 with standard missing values, use the standardizeMissing function. NaN is the standard missing value for single- and double-precision floating-point numeric arrays.

messyTable = standardizeMissing(messyTable,-99)
messyTable=21×5 table
      A        B          C         D       E  
    ______    ____    _________    ____    ____

    "afe1"     NaN    "yes"           3       3
    "egh3"     NaN    "no"            7       7
    "wth4"     NaN    "yes"         NaN     NaN
    "atn2"    23.7    <missing>     NaN    23.7
    "arg1"     NaN    "yes"           5     NaN
    "jre3"    34.6    "yes"        34.6    34.6
    "wen9"     234    "yes"         234     234
    "ple2"       2    "no"            2       2
    "dbo8"       5    "no"            5       5
    "oii4"       5    "yes"           5       5
    "wnk3"     245    "yes"         245     245
    "abk6"     563    "no"          563     563
    "pnj5"     463    "no"          463     463
    "wnn3"       6    "no"            6       6
    "oks9"      23    "yes"          23      23
    "wba3"      14    "yes"          14      14
      ⋮

To fill missing values, use the fillmissing function. It provides many methods that fill missing values. For example, return a new table, filledTable, where you fill missing values with their nearest neighbors that are not missing values.

filledTable = fillmissing(messyTable,"nearest")
filledTable=21×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    "afe1"    23.7    "yes"       3       3
    "egh3"    23.7    "no"        7       7
    "wth4"    23.7    "yes"       7    23.7
    "atn2"    23.7    "yes"       5    23.7
    "arg1"    34.6    "yes"       5    34.6
    "jre3"    34.6    "yes"    34.6    34.6
    "wen9"     234    "yes"     234     234
    "ple2"       2    "no"        2       2
    "dbo8"       5    "no"        5       5
    "oii4"       5    "yes"       5       5
    "wnk3"     245    "yes"     245     245
    "abk6"     563    "no"      563     563
    "pnj5"     463    "no"      463     463
    "wnn3"       6    "no"        6       6
    "oks9"      23    "yes"      23      23
    "wba3"      14    "yes"      14      14
      ⋮

Remove Rows with Missing Values

Another strategy for cleaning the missing values in a table is to remove the rows that have them.

To remove rows that have missing values, use the rmmissing function.

remainingTable = rmmissing(messyTable)
remainingTable=15×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    "jre3"    34.6    "yes"    34.6    34.6
    "wen9"     234    "yes"     234     234
    "ple2"       2    "no"        2       2
    "dbo8"       5    "no"        5       5
    "oii4"       5    "yes"       5       5
    "wnk3"     245    "yes"     245     245
    "abk6"     563    "no"      563     563
    "pnj5"     463    "no"      463     463
    "wnn3"       6    "no"        6       6
    "oks9"      23    "yes"      23      23
    "wba3"      14    "yes"      14      14
    "pkn4"       2    "no"        2       2
    "adw3"      22    "no"       22      22
    "poj2"    34.6    "yes"    34.6    34.6
    "bas8"      23    "no"       23      23

Sort Table Rows

Once you have cleaned a table, you can organize it other ways. For example, you can sort the rows of a table by the values in one or more variables.

Sort the rows by the values in the first variable, A.

sortedTable = sortrows(remainingTable)
sortedTable=15×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    "abk6"     563    "no"      563     563
    "adw3"      22    "no"       22      22
    "bas8"      23    "no"       23      23
    "dbo8"       5    "no"        5       5
    "jre3"    34.6    "yes"    34.6    34.6
    "oii4"       5    "yes"       5       5
    "oks9"      23    "yes"      23      23
    "pkn4"       2    "no"        2       2
    "ple2"       2    "no"        2       2
    "pnj5"     463    "no"      463     463
    "poj2"    34.6    "yes"    34.6    34.6
    "wba3"      14    "yes"      14      14
    "wen9"     234    "yes"     234     234
    "wnk3"     245    "yes"     245     245
    "wnn3"       6    "no"        6       6

Sort the rows in descending order by C, and then sort in ascending order by A.

sortedBy2Vars = sortrows(remainingTable,["C","A"],["descend","ascend"])
sortedBy2Vars=15×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    "jre3"    34.6    "yes"    34.6    34.6
    "oii4"       5    "yes"       5       5
    "oks9"      23    "yes"      23      23
    "poj2"    34.6    "yes"    34.6    34.6
    "wba3"      14    "yes"      14      14
    "wen9"     234    "yes"     234     234
    "wnk3"     245    "yes"     245     245
    "abk6"     563    "no"      563     563
    "adw3"      22    "no"       22      22
    "bas8"      23    "no"       23      23
    "dbo8"       5    "no"        5       5
    "pkn4"       2    "no"        2       2
    "ple2"       2    "no"        2       2
    "pnj5"     463    "no"      463     463
    "wnn3"       6    "no"        6       6

Sorting by C, the rows are grouped first by "yes", followed by "no". Then sorting by A, the rows are listed alphabetically.

To reorder the table so that A and C are next to each other, use the movevars function.

sortedRowsAndMovedVars = movevars(sortedBy2Vars,"C","After","A")
sortedRowsAndMovedVars=15×5 table
      A         C       B       D       E  
    ______    _____    ____    ____    ____

    "jre3"    "yes"    34.6    34.6    34.6
    "oii4"    "yes"       5       5       5
    "oks9"    "yes"      23      23      23
    "poj2"    "yes"    34.6    34.6    34.6
    "wba3"    "yes"      14      14      14
    "wen9"    "yes"     234     234     234
    "wnk3"    "yes"     245     245     245
    "abk6"    "no"      563     563     563
    "adw3"    "no"       22      22      22
    "bas8"    "no"       23      23      23
    "dbo8"    "no"        5       5       5
    "pkn4"    "no"        2       2       2
    "ple2"    "no"        2       2       2
    "pnj5"    "no"      463     463     463
    "wnn3"    "no"        6       6       6

See Also

| | | | | | | | |

Related Topics