Project Introduction
AIRS Level 3 Products On Demand
Try it yourself
Project Introduction
Mining Massive Earth Science Data Sets for Climate and Weather Forecast Models is a two-year project (FY 2005-2006) funded by
NASA's Earth Science Technology Office (ESTO) to study how a data reduction technique theoretically grounded
in data compression may be used
to compare large volumes of climate model generated and remote sensing observations. This technology creates small, reduced volume and complexity summary data sets, which can be used in place of the original data. A key property of these summaries is that they preserve the multivariate distributional features of the original data. Therefore, they can be aggregated in space and time, and used as a base product from which to derive custom data sets. Summaries of observational data can be used as input to models, for comparisons to summaries of model output, and for other exploratory studies intended to improve model physical parameterizations. We applied the methodology to both model generated and observational data sets in a variety
of settings and for a variety of purposes.
In particular, this page describes and demonstrates how reduced
volume and complexity summary data sets are used as a base data product from which custom summary products
can be derived. The idea is as follows. A statistically sound, summary data set resides on web-service
enabled computer. A user would like to obtain a gridded monthly data product based on a calculation performed
on the summarized data. The user fills out a form describing the calculation, and the calculation is performed
on the summary data. The results are returned to the user along with information about how much error
is introduced by the fact that the summary, rather than the original data, was used.
The web-service functionality is provided by the Genesis-II project at JPL, one of NASA's
REASoN (Research,
Education and Applications Solutions Network) projects.
This is a collaborative effort to infuse the
ESTO funded technology into the larger Genesis-II system which provides a public interface.
The core technology that enables remote manipulation of remote data sets is SciFlo, developed as part of Genesis-II. The statistical research was supported by ESTO, and also by the
Atmospheric Infrared Sounder (AIRS) and Multi-angle Imaging SpectroRadiometer (MISR) projects at JPL.
The latter two also provided computational support without which this work would not have been possible.
Our first examples will therefore use AIRS and MISR data.
The AIRS project is in the process of implementing this technology to produce the AIRS Level 3
Quantization Product (L3Q) as a standard data product. Files containing monthly and 5-day summaries of
35 AIRS parameters will be available through the Goddard DAAC when the product becomes public. Here we
use a preliminary version of L3Q for the month of August 2003 to illustrate how SciFlo enables remote
calculation using these data. MISR will be incorporated when suitable L3Q data products exist, funding
permitting.
The next section describes the AIRS L3Q data product and how it is used to produce custom, user-defined
data products. Finally, "Try it yourself" is the demo page.
AIRS Level 3 Products On Demand
AIRS Level 3 Products on Demand are estimates of various quantities computed from the AIRS Level 3 Quantization Product. The "true" values of such quantities are those values that would be computed directly from AIRS Level 2 data sets. However, the difficulties inherent in acquiring and manipulating large numbers of HDF files often make this impractical. In addition, the computation times for these calculations and computing resources required can be prohibitive.
The AIRS Level 3 Quantization Product
The AIRS Level 3 Quantization Product (L3Q) provides distributional summaries of Level 2 data on monthly, five-degree grids.
For each grid cell, the distributional summary is a varying number of representative vectors, their associated counts and errors.
Representatives "stand in" for a collection of AIRS Level 2 data vectors called a cluster. A cluster representative is
the centroid (or mean vector) of the corresponding Level 2 data vectors.
Conceptually, a Level 2 data vector is a vector formed by appending all AIRS Level 2 measurement values for the same time and location. For instance, the measurements shown in the table below for the same AIRS footprint
become one Level 2 data vector which we will denote
L3Q may be best understood as an extension of traditional Level 3 monthly data products. Traditional Level 3 products normally
provide the monthly average, standard deviation and count by grid cell for a set of variables. If L3Q was constrained to provide
exactly one representative per grid cell per month, it would have the same information except for the fact that the individual
variable counts and standard deviations would all be rolled into single values for count and error.
By allowing multiple
representative vectors per grid cell, L3Q achieves two improvements over traditional Level 3 products. First, it approximates
the Level 2 data distribution. The mean and standard deviation only fully specify a data distribution if that distribution is Gaussian, which in general can't be guaranteed. Second, by treating AIRS data as vectors, it approximately preserves the joint relationships among variables. Joint relationships among pairs of variables are only captured if covariances or correlations are
explicitly provided in traditional Level 3 products, and even if they are, these statistics are measures of linear association only. We have every reason to suspect that important relationships the AIRS data may be non-linear.
The larger the number of representatives used for a grid cell, the better will be the approximation by the summary distribution of
the original, Level 2 multivariate data distribution. In the extreme, if the number of clusters equals the number of raw data vectors,
then every cluster contains just a single AIRS Level 2 data vector, which is the cluster representative. The count in each cluster is just one, and the errors are zero. This provides a perfect representation of the original data, but no data reduction. At the other extreme, we could allow just one cluster per grid cell as described above. In that case, data reduction is maximized, but so is error. The algorithm that creates L3Q seeks a good compromise between these two extremes. The algorithm uses information-theoretic tests to determine whether additional error incurred as a result of increased data reduction is warranted. The number of clusters (and the
assignment of Level 2 data vectors to them) is modulated to reflect the information-theoretic complexity of the data being summarized.
For example, in a grid cell where almost all the Level 2 data vectors are nearly the same, only one cluster would be required to
preserve most of the information with little error. However, in a very heterogeneous grid cell, a larger number of clusters would be
required to represent the multivariate data distribution with a similar, low error.
By better preserving the multivariate distribution, L3Q can more reliably be used as the basis for calculations than can traditional
Level 3 products. By quantifying and reporting cluster errors, we also maintain an ability to quantify the likely errors in calculations based on L3Q relative to what would have been obtained had those same calculations been performed on Level 2 data
vectors. This is the basis for AIRS Level 3 Products on Demand.
Functions versus Distribution Parameters
Suppose our goal is to estimate the behavior of a scalar-valued function of AIRS Level 2 data vectors:
Summary parameters like the mean, median, and variance, are quantities that are descriptive of some aspect a distribution.
Ideally, we would like to examine and understand the whole distribution because it contains more information than single summary
values. In practice one usually examines parameters instead because they are easier to work with, but the choice
of which parameter to use depends on one's science objective. Applying transformations allow scientists to better understand their data by changing scale, or using the data to make predictions, for example.
With L3Q, the
choice of both transformation and parameter are up to the user instead of being hard-wired into a static data product.
Estimating Distributions of Functional Transformations
L3Q provides us with a set of pseudo-values of
The cluster error,
.
(If any of the variable values are missing, the footprint is ignored.)
Variables Summarized by the AIRS Level 3 Quantization Product
Variable index
Variable name
Description
0
tair at 150 mb
atmospheric temperature at 150 mb
1
tair at 200 mb
atmospheric temperature at 200 mb
2
tair at 250 mb
atmospheric temperature at 250 mb
3
tair at 300 mb
atmospheric temperature at 300 mb
4
tair at 400 mb
atmospheric temperature at 400 mb
5
tair at 500 mb
atmospheric temperature at 500 mb
6
tair at 600 mb
atmospheric temperature at 600 mb
7
tair at 700 mb
atmospheric temperature at 700 mb
8
tair at 850 mb
atmospheric temperature at 850 mb
9
tair at 925 mb
atmospheric temperature at 925 mb
10
tair at 1000 mb
atmospheric temperature at 1000 mb
11
h2ommr at 150 mb
atmospheric water vapor at 150 mb
12
h2ommr at 200 mb
atmospheric water vapor at 200 mb
13
h2ommr at 250 mb
atmospheric water vapor at 250 mb
14
h2ommr at 300 mb
atmospheric water vapor at 300 mb
15
h2ommr at 400 mb
atmospheric water vapor at 400 mb
16
h2ommr at 500 mb
atmospheric water vapor at 500 mb
17
h2ommr at 600 mb
atmospheric water vapor at 600 mb
18
h2ommr at 700 mb
atmospheric water vapor at 700 mb
19
h2ommr at 850 mb
atmospheric water vapor at 850 mb
20
h2ommr at 925 mb
atmospheric water vapor at 925 mb
21
h2ommr at 1000 mb
atmospheric water vapor at 1000 mb
22
cldfrc at 200 mb
cloud fraction at 200 mb
23
cldfrc at 250 mb
cloud fraction at 250 mb
24
cldfrc at 300 mb
cloud fraction at 300 mb
25
cldfrc at 400 mb
cloud fraction at 400 mb
26
cldfrc at 500 mb
cloud fraction at 500 mb
27
cldfrc at 600 mb
cloud fraction at 600 mb
28
cldfrc at 700 mb
cloud fraction at 700 mb
29
cldfrc at 850 mb
cloud fraction at 850 mb
30
cldfrc at 925 mb
cloud fraction at 925 mb
31
cldfrc at 1000 mb
cloud fraction at 1000 mb
32
landfrc
land fraction of scene
33
rettype
fraction of good quality observations
34
scannodetype
fraction daytime observations
The summarization algorithm treats the collection
of all Level 2 data vectors belonging to the same five-degree grid cell in the same month as a collection of points in
35-dimensional space, and finds centers of mass in this high-dimensional space. The centers of mass become the cluster
representatives, which have the same fields shown in the table. The cluster counts are the numbers of AIRS Level 2 data vectors
assigned to the clusters. The cluster errors are the average squared euclidian distance between the representatives and the cluster members. The graphic below depicts the situation in the case where there are only two variables, not 35.


In the graphic on the left, the left panel is a schematic
of AIRS Level 2 data for two variables in one grid cell. The upper panel shows a scatterplot in which each data
point has equal mass, and the lower panel shows the corresponding data table. The right panel shows the
corresponding summarized data. Here cluster representatives have differing masses
corresponding to the cluster counts,
.
is the number of clusters, and is smaller than
the number of Level 2 data vectors,
.
The graphic on the far right zooms in on one cluster representative, and shows several Level 2 data vectors
assigned to that cluster. The distances (here in two-dimensions and shown in gold) between the
representative and its members forms the basis for calculating the cluster error. The error is the average of
the squared distances.
. If we could apply
to each AIRS Level 2 data vector
, we could examine the distribution of
by making a histogram. If we wanted to select
one or more summary parameters,
,
to describe the distribution, we would have many to choose from: the mean, the median,
quantiles, etc. Hence, we distinguish between functional transformations of the data,
, and parameters which describe aspects of
their distributions,
.

The figure shows what happens if we begin with a discrete distribution of
values for which the histogram is shown in green, and apply a non-linear function to those values. Here the non-linear
function is just the square function, and the distribution of the transformed values is shown by the blue histogram.
for
each grid cell. Call these pseudo-values
where
indexes cluster, and
indexes data point within cluster.
Within each cluster all the pseudo-values are the same: they are equal to the cluster's representative vector,
and there are
of them.
(
is the count for the kth cluster.)
Obviously, the distribution of
will only be an approximation of the distribution of
.
The quality of the approximation depends on how close the
's
are to the original level Level 2
's, and how
the function
treats those discrepancies.
is a measure of how close the
original Level 2 AIRS data vectors are to their pseudo-values:
.An overall measure of quality is the average cluster error for the entire distributional summary, also called the distortion:
,
.For an arbitrary function
Estimating Distribution Parameters
The graphic below depicts our strategy for estimating arbitrary parameters of distributions of arbitrary functions.
The cluster errors (
) provide upper bounds
for the variances of all individual variables since the error is the trace of the within-cluster covariance matrix.
The left panel depicts the Level 2 distribution conceptually as a set of equally likely individual data points,
We model the distribution of the
The middle panel of the figure above depicts the process of resampling from the Gaussian mixture model. There are B trials in
this simulation. For each trial, we draw a synthetic data set as described in the previous paragraph, apply transformation
The right panel shows the simulated sampling distribution and the
parameter values we report for inferential purposes. We report the quantiles at 2.5, 50, and 97.5 percent. This constitutes
a point estimate (median), and a 95% confidence interval:
Clicking on "Try it yourself" will take you to the Genesis-II page for this project. There you will find a listing of
"SciFlo's". A SciFlo is an XML document that invokes an operation on a remote computer. This may be accomplished by
triggering a web service (a program installed on that remote computer), or sending a bundle of code from the local machine
to the remote machine. Usually the remote machine is the one on which the data reside since it's easier to move code
than data. Clicking on "xml" shows you the XML document. Clicking on "execute" presents you with a form
based on the inputs specified in the XML document.
This is how the systems passes arguments over the web to the remote functions.
Pushing "submit" initiates the calculation, and when finished, output is either displayed or can be found at a location
specified by a returned url. The system also provides a graphical visualization to monitor progress.
We currently have three SciFlo's: AIRSGetMeanMap.sf.xml, AIRSGetDistortionMap.sf.xml, and AIRSCorrMap.sf.xml. The
first generates a map of the mean value (over clusters) of the field entered in the variableName box.
The second generates a map of the mean value (over clusters) of cluster error. These SciFlo's are very fast
because there is no uncertainty associated with their results. For instance, the true mean value of a variable in
a grid cell can be exactly reproduced by weight-averaging the cluster mean values of the variable.
Correlation, on the
other hand, is a non-linear function of the variables:
, shown as the bottom histogram. Above them
are bars of unequal height depicting
cluster representatives for the four clusters into which the
's have been grouped:
,
,
, and
. The
's measure within-cluster dispersions.
's as a probabilistic
mixture of Gaussian (Normal) distributions. With a Gaussian mixture, one draws a value of
at random by first choosing a cluster at random,
with probabilities
,
then generating a value from a multivariate Gaussian distribution with mean vector
and covariance matrix
.
(This approximation of the covariance matrix ignores the effects of
covariances. On the other hand, since
is the trace of
the true within-cluster covariance matrix,
over estimates the true within-cluster variances.) The Law of Large Numbers guarantees that as the number of draws gets large,
the proportions with which one encounters the cluster
converges to
. However, since we already know the values of the
's, we can short cut this process by fixing
the proportion of draws from each cluster appropriately. For example, if the total number of draws is
,
we draw
(or the nearest integer thereto)
from cluster
. The remaining question is how large
must be. Ideally,
, but that may be computationally prohibitive.
We are still investigating this issue, but in the mean time we are using
.
to its members
and then compute the distributional parameter,
.
This distribution of the B values of
is the synthetic sampling distribution upon which we base an inference about the true value of
.
.
Try it yourself
,where:
and so operating on the summary data will not produce what would have been obtained by operating on the original data. Therefore, we must use the simulation strategy outlined above to obtain point estimates and confidence intervals for grid cell correlations. AIRSCorrMap.sf.xml is the SciFlo for this calculation, and runs about 6 hours.
At this time only one month of AIRS L3Q is available: August 2003. Additional months will be posted shortly, and this page will be updated accordingly. Also, please remember that this is work in progress, and there may be some lapses in functionality. If you have comments or questions, please send them to Amy Braverman at Amy.Braverman@jpl.nasa.gov.






