The Data Model (Introduction)

1. Field Object
The "Field" concept is the core concept in DX. Let's start by thinking about the assumptions implicit in data collection, analysis, or models. In all cases, data values are sampled at discrete intervals. This is true whether the phenomenon is inherently discrete (average age of students at different high schools in a large city), or continuous (cloudwater density in a thunder cloud).

Data sampling is discrete for some fundamental reasons: (1) no one has the budget to sample at infinitely high precision, and (2) Heisenberg's Uncertainty Principle implies that we can't technically record information (accurately) at infinite resolution anyway (the first reason is more relevant to most people). Pelkie's Corollary to (2): a ruler with infinitely fine markings looks blurry.

The implicit assumption in data sampling is that there is an underlying coordinate system upon which the data (discrete or continuous) is mapped. That is to say, we not only record "how much" but "where" even if the coordinate system is not explicitly described in the data set. This is an important point to emphasize because newcomers to visualization say, 'well, my data is an array; what coordinate system are you talking about?" The answer is, if no other spatial system is provided, the array's grid is the coordinate system; the data values in the grid cells (or at the grid intersections) are the data. In other words, even if you don't think about the grid per se, DX needs to know where in space it will plot the data representations. So we must describe some kind of spatial coordinate frame in which the data values are pitted in order for DX to make visual representations. In fact, though, except for some pure mathematicians, you probably measured or modeled some "real" experimental domain, so even though you didn't write down every [X, Y] location explicitly, you are working in a "real" space. (Pure mathematicians, rejoice: DX doesn't care if the space you declare is "real" or not.)

Some examples from different disciplines of this spatial coordinate frame include:

  • Remotely-sensed data recorded at particular latitude-longitude-time coordinates
  • Computational fluid dynamics data measured within volumetric elements in a 3D volume
  • Finite element analysis stresses measured at nodes on an irregular 2D mesh
  • Economic data values recorded at certain times (time is a perfectly good "axis" for a visual "space"; for that matter, so is any other monotonic function)

The Field is the device that binds together the "where" with the "how much," or in DX terms, the "positions" and the "data." Furthermore, if the data is sampled from a continuous distribution (and this is actually far more common than discrete data), we must tell DX something about the sampling continuum. Since we can't sample data at infinite resolution, we make the assumption in scientific research that if we sample at a "good enough" resolution, then it is meaningful to interpolate between the sample points to get arbitrary "in-between" values.

For example, if we measure soil acidity on a 1 meter grid over a 30m x 30m plot, we assume that the measurements at each grid cell corner can be averaged and weighted such that we can derive with reasonable accuracy what the soil acidity is at the point [10.3 m, 21.5 m]. When we request DX to plot a colormapped surface of the soil acidity data measured at each of the 900 grid points, we expect it to "fill in the blanks" across the surface of the grid and figure out a reasonable color (by interpolation) for all the pixels in the image we observe. If it could not do this, we'd simply see 900 discrete colored dots. But DX can only do this interpolation operation if we declare that in fact the grid is a connected mesh. The "connections" declaration tells DX which grid points connect to which others to form a continuous interpolation surface.

 

Again consider the case of a line plot. If there were no lines, it would be a scatter plot, denoting discrete samples (or correlated values if the two axes were covarying measures). When we connect the points with lines, we are declaring the points to be samples of a continuum. Any points that fall along a line between two actual sample points must be derived by interpolation. One could thus determine both the interpolated data value (the Y) and the interpolated position value (the X) for any point along the line. And in fact, that's how all plotting programs (or a person with a ruler) make a line plot. When you draw each line with your pencil, you are "rendering an interpolated line." You just never thought about it much before.

Consider this: if we told DX that for each grid cell, the four corners were connected into a quadrilateral surface, then we could ask DX to find the value at any point on the surface of that quadrilateral and color it or rubbersheet it or find contours through it and so on. However, if we declare that the connectivity is only between adjacent pairs of points, such that the grid is made up of lines like a fish net, then we can only interpolate values between adjacent (connected) points (values that lie on the lines) and we cannot find values within the empty spaces of the grid cells as there are no "cells" in this case. Obviously, the first case is more common, but it is possible to use the second form when needed.

Let's return to the Field description. You can consider the spatial coordinate system to be the "independent variable" and the data values sampled on this coordinate system to be the "dependent variable" since the value you measure depends on where it was sampled. That leads to the implication that in a non-trivial (non-null) Field, there must be a "positions" component: you can't have a dependent variable without the presence of an independent variable, but you can have an independent variable with no dependent values (the sampling grid before you perform the measurements, for example). In DX, you can have a Field with only "positions," but never a Field with only "data."

If the data is discretely sampled (scattered data), there is no "connections" component in the Field; the presence of one would be "illegal" since the nature of the data implies no interpolation between sample points (what's the average age of high school kids at the point halfway between two sample high schools?). However, it is far more common that there is a "connections" component.

And of course, the "data" component is usually present since our purpose is generally to visualize "data".

So a Field is most often made up of at least three "components" whose names are keywords in DX, that is, their meaning is precise and specific and cannot be redefined by the user: "positions", "connections", and "data". There are many other components but we'll stick with these for now.