spacer spacer spacer
spacer spacer spacer
spacer
NASA Logo - Jet Propulsion Laboratory + View the NASA Portal Search JPL
JPL Home Earth Solar System Stars and Galaxies Technology
Jet Propulsion Laboratory
spacer
spacer spacer spacer
spacer
Mining Massive Earth Science Data Sets for Climate and Weather Forecast Models


Project Introduction

AIRS Level 3 Products On Demand

Try it yourself



Project Introduction

Mining Massive Earth Science Data Sets for Climate and Weather Forecast Models is a two-year project (FY 2005-2006) funded by NASA's Earth Science Technology Office (ESTO) to study how a data reduction technique theoretically grounded in data compression may be used to compare large volumes of climate model generated and remote sensing observations. This technology creates small, reduced volume and complexity summary data sets, which can be used in place of the original data. A key property of these summaries is that they preserve the multivariate distributional features of the original data. Therefore, they can be aggregated in space and time, and used as a base product from which to derive custom data sets. Summaries of observational data can be used as input to models, for comparisons to summaries of model output, and for other exploratory studies intended to improve model physical parameterizations. We applied the methodology to both model generated and observational data sets in a variety of settings and for a variety of purposes.

In particular, this page describes and demonstrates how reduced volume and complexity summary data sets are used as a base data product from which custom summary products can be derived. The idea is as follows. A statistically sound, summary data set resides on web-service enabled computer. A user would like to obtain a gridded monthly data product based on a calculation performed on the summarized data. The user fills out a form describing the calculation, and the calculation is performed on the summary data. The results are returned to the user along with information about how much error is introduced by the fact that the summary, rather than the original data, was used. The web-service functionality is provided by the Genesis-II project at JPL, one of NASA's REASoN (Research, Education and Applications Solutions Network) projects.

This is a collaborative effort to infuse the ESTO funded technology into the larger Genesis-II system which provides a public interface. The core technology that enables remote manipulation of remote data sets is SciFlo, developed as part of Genesis-II. The statistical research was supported by ESTO, and also by the Atmospheric Infrared Sounder (AIRS) and Multi-angle Imaging SpectroRadiometer (MISR) projects at JPL. The latter two also provided computational support without which this work would not have been possible. Our first examples will therefore use AIRS and MISR data.

The AIRS project is in the process of implementing this technology to produce the AIRS Level 3 Quantization Product (L3Q) as a standard data product. Files containing monthly and 5-day summaries of 35 AIRS parameters will be available through the Goddard DAAC when the product becomes public. Here we use a preliminary version of L3Q for the month of August 2003 to illustrate how SciFlo enables remote calculation using these data. MISR will be incorporated when suitable L3Q data products exist, funding permitting.

The next section describes the AIRS L3Q data product and how it is used to produce custom, user-defined data products. Finally, "Try it yourself" is the demo page.



AIRS Level 3 Products On Demand

AIRS Level 3 Products on Demand are estimates of various quantities computed from the AIRS Level 3 Quantization Product. The "true" values of such quantities are those values that would be computed directly from AIRS Level 2 data sets. However, the difficulties inherent in acquiring and manipulating large numbers of HDF files often make this impractical. In addition, the computation times for these calculations and computing resources required can be prohibitive.

The AIRS Level 3 Quantization Product

The AIRS Level 3 Quantization Product (L3Q) provides distributional summaries of Level 2 data on monthly, five-degree grids. For each grid cell, the distributional summary is a varying number of representative vectors, their associated counts and errors. Representatives "stand in" for a collection of AIRS Level 2 data vectors called a cluster. A cluster representative is the centroid (or mean vector) of the corresponding Level 2 data vectors.

Conceptually, a Level 2 data vector is a vector formed by appending all AIRS Level 2 measurement values for the same time and location. For instance, the measurements shown in the table below for the same AIRS footprint become one Level 2 data vector which we will denote . (If any of the variable values are missing, the footprint is ignored.)

Variables Summarized by the AIRS Level 3 Quantization Product
Variable index Variable name Description
0 tair at 150 mb atmospheric temperature at 150 mb
1 tair at 200 mb atmospheric temperature at 200 mb
2 tair at 250 mb atmospheric temperature at 250 mb
3 tair at 300 mb atmospheric temperature at 300 mb
4 tair at 400 mb atmospheric temperature at 400 mb
5 tair at 500 mb atmospheric temperature at 500 mb
6 tair at 600 mb atmospheric temperature at 600 mb
7 tair at 700 mb atmospheric temperature at 700 mb
8 tair at 850 mb atmospheric temperature at 850 mb
9 tair at 925 mb atmospheric temperature at 925 mb
10 tair at 1000 mb atmospheric temperature at 1000 mb
11 h2ommr at 150 mb atmospheric water vapor at 150 mb
12 h2ommr at 200 mb atmospheric water vapor at 200 mb
13 h2ommr at 250 mb atmospheric water vapor at 250 mb
14 h2ommr at 300 mb atmospheric water vapor at 300 mb
15 h2ommr at 400 mb atmospheric water vapor at 400 mb
16 h2ommr at 500 mb atmospheric water vapor at 500 mb
17 h2ommr at 600 mb atmospheric water vapor at 600 mb
18 h2ommr at 700 mb atmospheric water vapor at 700 mb
19 h2ommr at 850 mb atmospheric water vapor at 850 mb
20 h2ommr at 925 mb atmospheric water vapor at 925 mb
21 h2ommr at 1000 mb atmospheric water vapor at 1000 mb
22 cldfrc at 200 mb cloud fraction at 200 mb
23 cldfrc at 250 mb cloud fraction at 250 mb
24 cldfrc at 300 mb cloud fraction at 300 mb
25 cldfrc at 400 mb cloud fraction at 400 mb
26 cldfrc at 500 mb cloud fraction at 500 mb
27 cldfrc at 600 mb cloud fraction at 600 mb
28 cldfrc at 700 mb cloud fraction at 700 mb
29 cldfrc at 850 mb cloud fraction at 850 mb
30 cldfrc at 925 mb cloud fraction at 925 mb
31 cldfrc at 1000 mb cloud fraction at 1000 mb
32 landfrc land fraction of scene
33 rettype fraction of good quality observations
34 scannodetype fraction daytime observations


The summarization algorithm treats the collection of all Level 2 data vectors belonging to the same five-degree grid cell in the same month as a collection of points in 35-dimensional space, and finds centers of mass in this high-dimensional space. The centers of mass become the cluster representatives, which have the same fields shown in the table. The cluster counts are the numbers of AIRS Level 2 data vectors assigned to the clusters. The cluster errors are the average squared euclidian distance between the representatives and the cluster members. The graphic below depicts the situation in the case where there are only two variables, not 35.



In the graphic on the left, the left panel is a schematic of AIRS Level 2 data for two variables in one grid cell. The upper panel shows a scatterplot in which each data point has equal mass, and the lower panel shows the corresponding data table. The right panel shows the corresponding summarized data. Here cluster representatives have differing masses corresponding to the cluster counts, . is the number of clusters, and is smaller than the number of Level 2 data vectors, . The graphic on the far right zooms in on one cluster representative, and shows several Level 2 data vectors assigned to that cluster. The distances (here in two-dimensions and shown in gold) between the representative and its members forms the basis for calculating the cluster error. The error is the average of the squared distances.

L3Q may be best understood as an extension of traditional Level 3 monthly data products. Traditional Level 3 products normally provide the monthly average, standard deviation and count by grid cell for a set of variables. If L3Q was constrained to provide exactly one representative per grid cell per month, it would have the same information except for the fact that the individual variable counts and standard deviations would all be rolled into single values for count and error. By allowing multiple representative vectors per grid cell, L3Q achieves two improvements over traditional Level 3 products. First, it approximates the Level 2 data distribution. The mean and standard deviation only fully specify a data distribution if that distribution is Gaussian, which in general can't be guaranteed. Second, by treating AIRS data as vectors, it approximately preserves the joint relationships among variables. Joint relationships among pairs of variables are only captured if covariances or correlations are explicitly provided in traditional Level 3 products, and even if they are, these statistics are measures of linear association only. We have every reason to suspect that important relationships the AIRS data may be non-linear.

The larger the number of representatives used for a grid cell, the better will be the approximation by the summary distribution of the original, Level 2 multivariate data distribution. In the extreme, if the number of clusters equals the number of raw data vectors, then every cluster contains just a single AIRS Level 2 data vector, which is the cluster representative. The count in each cluster is just one, and the errors are zero. This provides a perfect representation of the original data, but no data reduction. At the other extreme, we could allow just one cluster per grid cell as described above. In that case, data reduction is maximized, but so is error. The algorithm that creates L3Q seeks a good compromise between these two extremes. The algorithm uses information-theoretic tests to determine whether additional error incurred as a result of increased data reduction is warranted. The number of clusters (and the assignment of Level 2 data vectors to them) is modulated to reflect the information-theoretic complexity of the data being summarized. For example, in a grid cell where almost all the Level 2 data vectors are nearly the same, only one cluster would be required to preserve most of the information with little error. However, in a very heterogeneous grid cell, a larger number of clusters would be required to represent the multivariate data distribution with a similar, low error.

By better preserving the multivariate distribution, L3Q can more reliably be used as the basis for calculations than can traditional Level 3 products. By quantifying and reporting cluster errors, we also maintain an ability to quantify the likely errors in calculations based on L3Q relative to what would have been obtained had those same calculations been performed on Level 2 data vectors. This is the basis for AIRS Level 3 Products on Demand.

Functions versus Distribution Parameters

Suppose our goal is to estimate the behavior of a scalar-valued function of AIRS Level 2 data vectors: . If we could apply to each AIRS Level 2 data vector , we could examine the distribution of by making a histogram. If we wanted to select one or more summary parameters, , to describe the distribution, we would have many to choose from: the mean, the median, quantiles, etc. Hence, we distinguish between functional transformations of the data, , and parameters which describe aspects of their distributions, .

The figure shows what happens if we begin with a discrete distribution of values for which the histogram is shown in green, and apply a non-linear function to those values. Here the non-linear function is just the square function, and the distribution of the transformed values is shown by the blue histogram.

Summary parameters like the mean, median, and variance, are quantities that are descriptive of some aspect a distribution. Ideally, we would like to examine and understand the whole distribution because it contains more information than single summary values. In practice one usually examines parameters instead because they are easier to work with, but the choice of which parameter to use depends on one's science objective. Applying transformations allow scientists to better understand their data by changing scale, or using the data to make predictions, for example. With L3Q, the choice of both transformation and parameter are up to the user instead of being hard-wired into a static data product.



Estimating Distributions of Functional Transformations

L3Q provides us with a set of pseudo-values of for each grid cell. Call these pseudo-values where indexes cluster, and indexes data point within cluster. Within each cluster all the pseudo-values are the same: they are equal to the cluster's representative vector, and there are of them. ( is the count for the kth cluster.) Obviously, the distribution of will only be an approximation of the distribution of . The quality of the approximation depends on how close the 's are to the original level Level 2 's, and how the function treats those discrepancies.

The cluster error, is a measure of how close the original Level 2 AIRS data vectors are to their pseudo-values:

.


An overall measure of quality is the average cluster error for the entire distributional summary, also called the distortion:

, .


For an arbitrary function there is no easy way to characterize the difference between the distributions of and based solely on distortion or on cluster error. However, it is possible to characterize the differences between certain parameters of the distributions of and . This provides a basis for inference about from .

Estimating Distribution Parameters

The graphic below depicts our strategy for estimating arbitrary parameters of distributions of arbitrary functions. The cluster errors () provide upper bounds for the variances of all individual variables since the error is the trace of the within-cluster covariance matrix.

The left panel depicts the Level 2 distribution conceptually as a set of equally likely individual data points, , shown as the bottom histogram. Above them are bars of unequal height depicting cluster representatives for the four clusters into which the 's have been grouped: , , , and . The 's measure within-cluster dispersions.

We model the distribution of the 's as a probabilistic mixture of Gaussian (Normal) distributions. With a Gaussian mixture, one draws a value of at random by first choosing a cluster at random, with probabilities , then generating a value from a multivariate Gaussian distribution with mean vector and covariance matrix . (This approximation of the covariance matrix ignores the effects of covariances. On the other hand, since is the trace of the true within-cluster covariance matrix, over estimates the true within-cluster variances.) The Law of Large Numbers guarantees that as the number of draws gets large, the proportions with which one encounters the cluster converges to . However, since we already know the values of the 's, we can short cut this process by fixing the proportion of draws from each cluster appropriately. For example, if the total number of draws is , we draw (or the nearest integer thereto) from cluster . The remaining question is how large must be. Ideally, , but that may be computationally prohibitive. We are still investigating this issue, but in the mean time we are using .

The middle panel of the figure above depicts the process of resampling from the Gaussian mixture model. There are B trials in this simulation. For each trial, we draw a synthetic data set as described in the previous paragraph, apply transformation to its members and then compute the distributional parameter, . This distribution of the B values of is the synthetic sampling distribution upon which we base an inference about the true value of .

The right panel shows the simulated sampling distribution and the parameter values we report for inferential purposes. We report the quantiles at 2.5, 50, and 97.5 percent. This constitutes a point estimate (median), and a 95% confidence interval: .



Try it yourself

Clicking on "Try it yourself" will take you to the Genesis-II page for this project. There you will find a listing of "SciFlo's". A SciFlo is an XML document that invokes an operation on a remote computer. This may be accomplished by triggering a web service (a program installed on that remote computer), or sending a bundle of code from the local machine to the remote machine. Usually the remote machine is the one on which the data reside since it's easier to move code than data. Clicking on "xml" shows you the XML document. Clicking on "execute" presents you with a form based on the inputs specified in the XML document. This is how the systems passes arguments over the web to the remote functions. Pushing "submit" initiates the calculation, and when finished, output is either displayed or can be found at a location specified by a returned url. The system also provides a graphical visualization to monitor progress.

We currently have three SciFlo's: AIRSGetMeanMap.sf.xml, AIRSGetDistortionMap.sf.xml, and AIRSCorrMap.sf.xml. The first generates a map of the mean value (over clusters) of the field entered in the variableName box. The second generates a map of the mean value (over clusters) of cluster error. These SciFlo's are very fast because there is no uncertainty associated with their results. For instance, the true mean value of a variable in a grid cell can be exactly reproduced by weight-averaging the cluster mean values of the variable.

Correlation, on the other hand, is a non-linear function of the variables:

,

where:
,

,

,

,

and so operating on the summary data will not produce what would have been obtained by operating on the original data. Therefore, we must use the simulation strategy outlined above to obtain point estimates and confidence intervals for grid cell correlations. AIRSCorrMap.sf.xml is the SciFlo for this calculation, and runs about 6 hours.

At this time only one month of AIRS L3Q is available: August 2003. Additional months will be posted shortly, and this page will be updated accordingly. Also, please remember that this is work in progress, and there may be some lapses in functionality. If you have comments or questions, please send them to Amy Braverman at Amy.Braverman@jpl.nasa.gov.

spacer
spacer spacer spacer
spacer
FIRST GOV NASA Home Page
spacer
spacer spacer spacer
spacer spacer spacer