Bioinformatics Toolbox

Working with GEO Series Data

This example shows how to retrieve gene expression data series (GSE) from the NCBI Gene Expression Omnibus (GEO) and perform basic analysis on the expression profiles.

Introduction

The NCBI Gene Expression Omnibus (GEO) is the largest public repository of high-throughput microarray experimental data. GEO data have four entity types including GEO Platform (GPL), GEO Sample (GSM), GEO Series (GSE) and curated GEO DataSet (GDS).

A Platform record describes the list of elements on the array in the experiment (e.g., cDNAs, oligonucleotide probesets). Each Platform record is assigned a unique and stable GEO accession number (GPLxxx).

A Sample record describes the conditions under which an individual Sample was handled, the manipulations it underwent, and the abundance measurement of each element derived from it. Each Sample record is assigned a unique and stable GEO accession number (GSMxxx).

A Series record defines a group of related Samples and provides a focal point and description of the whole study. Series records may also contain tables describing extracted data, summary conclusions, or analyses. Each Series record is assigned a unique GEO accession number (GSExxx).

A DataSet record (GDSxxx) represents a curated collection of biologically and statistically comparable GEO Samples. GEO DataSets (GDSxxx) are curated sets of GEO Sample data.

More information about GEO can be found in GEO Overview. Bioinformatics Toolbox provides functions that can retrieve and parse GEO format data files. GSE, GSM, GSD and GPL data can be retrieved by using the getgeodata function. The getgeodata function can also save the retrieved data in a text file. GEO Series records are available in SOFT format files and in tab-delimited text format files. The function geoseriesread reads the GEO Series text format file. The geosoftread function reads the usually quite large SOFT format files.

In this example, you will retrieve the GSE5847 data set from GEO database, and perform statistical testing on the data. GEO Series GSE5847 contains experimental data from a gene expression study of tumor stroma and epithelium cells from 15 inflammatory breast cancer (IBC) cases and 35 non-inflammatory breast cancer cases (Boersma et al. 2008).

Retrieving GEO Series Data

The function getgeodata returns a structure containing data retrieved from the GEO database. You can also save the returned data in its original format to your local file system for use in subsequent MATLAB® sessions. Note that data in public repositories is frequently curated and updated; therefore the results of this example might be slightly different when you use up-to-date datasets.

gseData = getgeodata('GSE5847', 'ToFile', 'GSE5847.txt')

Use the geoseriesread function to parse the previously downloaded GSE text format file.

gseData = geoseriesread('GSE5847.txt')
gseData = 

    Header: [1x1 struct]
      Data: [22283x95 bioma.data.DataMatrix]

The structure returned contains a Header field that stores the metadata of the Series data, and a Data field that stores the Series matrix data.

Exploring GSE Data

The GSE5847 matrix data in the Data field are stored as a DataMatrix object. A DataMatrix object is a data structure like MATLAB 2-D array, but with additional metadata of row names and column names. The properties of a DataMatrix can be accessed like other MATLAB objects.

get(gseData.Data)
            Name: ''
        RowNames: {22283x1 cell}
        ColNames: {1x95 cell}
           NRows: 22283
           NCols: 95
           NDims: 2
    ElementClass: 'double'

The row names are the identifiers of the probe sets on the array; the column names are the GEO Sample accession numbers.

gseData.Data(1:5, 1:5)
ans = 

                 GSM136326    GSM136327    GSM136328    GSM136329    GSM136330
    1007_s_at     10.45       9.3995       9.4248       9.4729       9.2788   
    1053_at      5.7195       4.8493       4.7321       4.7289       5.3264   
    117_at       5.9387       6.0833        6.448       6.1769       6.5446   
    121_at       8.0231       7.8947        8.345       8.1632       8.2338   
    1255_g_at    3.9548       3.9632       3.9641       4.0878       3.9989   

The Series metadata are stored in the Header field. The structure contains Series information in the Header.Series field, and sample information in the Header.Sample field.

gseData.Header
ans = 

     Series: [1x1 struct]
    Samples: [1x1 struct]

The Series field contains the title of the experiment and the microarray GEO Platform ID.

gseData.Header.Series
ans = 

                         title: 'Tumor and stroma from breast by LCM'
                 geo_accession: 'GSE5847'
                        status: 'Public on Sep 30 2007'
               submission_date: 'Sep 15 2006'
              last_update_date: 'Nov 14 2012'
                     pubmed_id: '17999412'
                       summary: 'Tumor epithelium and surrounding stromal ...'
                overall_design: 'We applied LCM to obtain samples enriched...'
                          type: 'Expression profiling by array'
                   contributor: 'Stefan,,Ambs
Brenda,,Boersma
Mark,,Reimers'
                     sample_id: 'GSM136326 GSM136327 GSM136328 GSM136329 G...'
                  contact_name: 'Stefan,,Ambs'
            contact_laboratory: 'LHC'
             contact_institute: 'NCI'
               contact_address: '37 Convent Dr Bldg 37 Room 3050'
                  contact_city: 'Bethesda'
                 contact_state: 'MD'
    contact_zip0x2Fpostal_code: '20892'
               contact_country: 'USA'
            supplementary_file: 'ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/s...'
                   platform_id: 'GPL96'
                platform_taxid: '9606'
                  sample_taxid: '9606'
                      relation: 'BioProject: http://www.ncbi.nlm.nih.gov/b...'

gseData.Header.Samples
ans = 

                         title: {1x95 cell}
                 geo_accession: {1x95 cell}
                        status: {1x95 cell}
               submission_date: {1x95 cell}
              last_update_date: {1x95 cell}
                          type: {1x95 cell}
                 channel_count: {1x95 cell}
               source_name_ch1: {1x95 cell}
                  organism_ch1: {1x95 cell}
           characteristics_ch1: {2x95 cell}
                  molecule_ch1: {1x95 cell}
          extract_protocol_ch1: {1x95 cell}
                     label_ch1: {1x95 cell}
            label_protocol_ch1: {1x95 cell}
                     taxid_ch1: {1x95 cell}
                  hyb_protocol: {1x95 cell}
                 scan_protocol: {1x95 cell}
                   description: {1x95 cell}
               data_processing: {1x95 cell}
                   platform_id: {1x95 cell}
                  contact_name: {1x95 cell}
            contact_laboratory: {1x95 cell}
             contact_institute: {1x95 cell}
               contact_address: {1x95 cell}
                  contact_city: {1x95 cell}
                 contact_state: {1x95 cell}
    contact_zip0x2Fpostal_code: {1x95 cell}
               contact_country: {1x95 cell}
            supplementary_file: {1x95 cell}
                data_row_count: {1x95 cell}

The data_processing field contains the information of the preprocessing methods, in this case the Robust Multi-array Average (RMA) procedure.

gseData.Header.Samples.data_processing(1)
ans = 

    'RMA'

The source_name_ch1 field contains the sample source:

sampleSources = unique(gseData.Header.Samples.source_name_ch1);
sampleSources{:}
ans =

human breast cancer stroma


ans =

human breast cancer tumor epithelium

The field Header.Samples.characteristics_ch1 contains the characteristics of the samples.

gseData.Header.Samples.characteristics_ch1(:,1)
ans = 

    'IBC'
    'Deceased'

Determine the IBC and non-IBC labels for the samples from the Header.Samples.characteristics_ch1 field as group labels.

sampleGrp = gseData.Header.Samples.characteristics_ch1(1,:);

Retrieving GEO Platform (GPL) Data

The Series metadata told us the array platform id: GPL96, which is an Affymetrix® GeneChip® Human Genome U133 array set HG-U133A. Retrieve the GPL96 SOFT format file from GEO using the getgeodata function. For example, assume you used the getgeodata function to retrieve the GPL96 Platform file and saved it to a file, such as GPL96.txt. Use the geosoftread function to parse this SOFT format file.

gplData = geosoftread('GPL96.txt')
gplData = 

                 Scope: 'PLATFORM'
             Accession: 'GPL96'
                Header: [1x1 struct]
    ColumnDescriptions: {16x1 cell}
           ColumnNames: {16x1 cell}
                  Data: {22283x16 cell}

The ColumnNames field of the gplData structure contains names of the columns for the data:

gplData.ColumnNames
ans = 

    'ID'
    'GB_ACC'
    'SPOT_ID'
    'Species Scientific Name'
    'Annotation Date'
    'Sequence Type'
    'Sequence Source'
    'Target Description'
    'Representative Public ID'
    'Gene Title'
    'Gene Symbol'
    'ENTREZ_GENE_ID'
    'RefSeq Transcript ID'
    'Gene Ontology Biological Process'
    'Gene Ontology Cellular Component'
    'Gene Ontology Molecular Function'

You can get the probe set ids and gene symbols for the probe sets of platform GPL69.

gplProbesetIDs = gplData.Data(:, strcmp(gplData.ColumnNames, 'ID'));
geneSymbols = gplData.Data(:, strcmp(gplData.ColumnNames, 'Gene Symbol'));

Use gene symbols to label the genes in the DataMatrix object gseData.Data. Be aware that the probe set IDs from the platform file may not be in the same order as in gseData.Data. In this example they are in the same order.

Change the row names of the expression data to gene symbols.

gseData.Data = rownames(gseData.Data, ':', geneSymbols);

Display the first five rows and five columns of the expression data with row names as gene symbols.

gseData.Data(1:5, 1:5)
ans = 

              GSM136326    GSM136327    GSM136328    GSM136329    GSM136330
    DDR1       10.45       9.3995       9.4248       9.4729       9.2788   
    RFC2      5.7195       4.8493       4.7321       4.7289       5.3264   
    HSPA6     5.9387       6.0833        6.448       6.1769       6.5446   
    PAX8      8.0231       7.8947        8.345       8.1632       8.2338   
    GUCA1A    3.9548       3.9632       3.9641       4.0878       3.9989   

Analyzing the Data

Bioinformatics Toolbox and Statistics Toolbox™ offer a wide spectrum of analysis and visualization tools for microarray data analysis. However, because it is not our main goal to explain the analysis methods in this example, you will apply only a few of the functions to the gene expression profile from stromal cells. For more elaborate examples about the gene expression analysis and feature selection tools available, see Exploring Gene Expression Data and Selecting Features for Classifying High-dimensional DataSelecting Features for Classifying High-dimensional Data.

The experiment was performed on IBC and non-IBC samples derived from stromal cells and epithelial cells. In this example, you will work with the gene expression profile from stromal cells. Determine the sample indices for the stromal cell type from the gseData.Header.Samples.source_name_ch1 field.

stromaIdx = strcmpi(sampleSources{1}, gseData.Header.Samples.source_name_ch1);

Determine number of samples from stromal cells.

nStroma = sum(stromaIdx)
nStroma =

    47

stromaData = gseData.Data(:, stromaIdx);
stromaGrp = sampleGrp(stromaIdx);

Determine the number of IBC and non-IBC stromal cell samples.

nStromaIBC = sum(strcmp('IBC', stromaGrp))
nStromaIBC =

    13

nStromaNonIBC = sum(strcmp('non-IBC', stromaGrp))
nStromaNonIBC =

    34

You can also label the columns in stromaData with the group labels:

stromaData = colnames(stromaData, ':', stromaGrp);

Display the histogram of the normalized gene expression measurements of a specified gene. The x-axes represent the normalized expression level. For example, inspect the distribution of the gene expression values of these genes.

fID = 331:339;
zValues = zscore(stromaData.(':')(':'), 0, 2);
bw = 0.25;
edges = -10:bw:10;
bins = edges(1:end-1) + diff(edges)/2;
histStroma = histc(zValues(fID, :)', edges) ./ (stromaData.NCols*bw);
figure;
for i = 1:length(fID)
    subplot(3,3,i);
    bar(edges, histStroma(:,i), 'histc')
    xlim([-3 3])
    if i <= length(fID)-3
        ax = gca;
        ax.XTickLabel = [];
    end
    title(sprintf('gene%d - %s', fID(i), stromaData.RowNames{fID(i)}))
end
suptitle('Gene Expression Value Distributions')

The gene expression profile was accessed using the Affymetrix GeneChip more then 22,000 features on a small number of samples (~100). Among the 47 tumor stromal samples, there are 13 IBC and 34 non-IBC. But not all the genes are differentially expressed between IBC and non-IBC tumors. Statistical tests are needed to identify a gene expression signature that distinguish IBC from non-IBC stromal samples.

Use genevarfilter to filter out genes with a small variance across samples.

[mask, stromaData] = genevarfilter(stromaData);
stromaData.NRows
ans =

       20055

Apply a t-statistic on each gene and compare p-values for each gene to find significantly differentially expressed genes between IBC and non-IBC groups by permuting the samples (1,000 times for this example).

rng default
[pvalues, tscores] = mattest(stromaData(:, 'IBC'), stromaData(:, 'non-IBC'),...
                           'Showhist', true', 'showplot', true, 'permute', 1000);

Select the genes at a specified p-value.

sum(pvalues < 0.001)
ans =

    52

There are about 50 genes selected directly at p-values < 0.001.

Sort and list the top 20 genes:

testResults = [pvalues, tscores];
testResults = sortrows(testResults);
testResults(1:20, :)
ans = 

                        p-values      t-scores
    INPP5E              2.3337e-05     5.0389 
    ARFRP1 /// IGLJ3    2.7575e-05     4.9753 
    USP46               3.4346e-05    -4.9054 
    GOLGB1              4.7706e-05    -4.7928 
    TTC3                0.00010702    -4.5053 
    THUMPD1             0.00013185    -4.4317 
                        0.00016066     4.3656 
    MAGED2              0.00017047    -4.3444 
    DNAJB9              0.00017833    -4.3266 
    KIF1C               0.00022129     4.2504 
                        0.00022251    -4.2482 
    DZIP3               0.00022425    -4.2454 
    COPB1               0.00023203    -4.2332 
    PSD3                0.00024659    -4.2138 
    PLEKHA4             0.00026506      4.186 
    DNAJB9              0.00027687    -4.1708 
    CNPY2               0.00028039    -4.1672 
    USP9X               0.00028454    -4.1619 
    SEC22B              0.00030187    -4.1392 
    GFER                0.00030527    -4.1352 

References

[1] Boersma, B.J., Reimers, M., Yi, M., Ludwig, J.A., et al. "A stromal gene signature associated with inflammatory breast cancer", International Journal of Cancer, 122(6):1324-32, 2008.