Advanced Datasets

Depending on the nature of the dataset that is analyzed, there are multiple data types implemented by InformationGeometry.jl to store them in. Mainly, these data types provide a trade-off in speed / simplicity versus flexibility / generality as illustrated by the following table:

Containerallows non-Gaussian y-uncertaintyallows missing valuesallows x-uncertaintyallows mixed x-y uncertaintyallows y-uncertainty estimationallows x-uncertainty estimation
DataSetExact(x::AbstractArray, y::AbstractArray, Σ_y::AbstractArray)
DataSetExact(x::AbstractArray, Σ_x::AbstractArray, y::AbstractArray, Σ_y::AbstractArray)
DataSetExact(xd::Distribution, yd::Distribution, dims::Tuple{Int,Int,Int}=(length(xd),1,1))

A data container which allows for uncertainties in the independent variables, i.e. $x$-variables. Moreover, the observed data is stored in terms of two probability distributions over the spaces $\mathcal{X}^N$ and $\mathcal{Y}^N$ respectively, which also allows for uncertainties in the observations that are non-Gaussian. For instance, the uncertainties associated with a given observation might follow a Cauchy, t-student, log-normal or some other smooth distribution.


using InformationGeometry, Distributions
X = product_distribution([Normal(0, 1), Cauchy(2, 0.5)])
Y = MvTDist(2, [3, 8.], [1 0.5; 0.5 3])
DataSetExact(X, Y, (2,1,1))

Uncertainties in the independent $x$-variables are optional for DataSetExact, and can be set to zero by wrapping the x-data in a InformationGeometry.Dirac "distribution". The following illustrates numerically equivalent ways of encoding a dataset whose uncertainties in the $x$-variables is zero:

using InformationGeometry, Distributions, LinearAlgebra
DS1 = DataSetExact(InformationGeometry.Dirac([1,2]), MvNormal([5,6], Diagonal([0.1, 0.2].^2)))
DS2 = DataSetExact([1,2], [5,6], [0.1, 0.2])
DS3 = DataSet([1,2], [5,6], [0.1, 0.2])

where DS1 == DS2 == DS3 will evaluate to true.


The CompositeDataSet type is a more elaborate (and typically less performant) container for storing data. Essentially, it splits observed data which has multiple y-components into separate data containers (e.g. of type DataSet), each of which corresponds to one of the components of the y-data. Crucially, each of the smaller data containers still shares the same "kind" of x-data, that is, the same xdim, units and so on, although they do not need to share the exact same particular x-data.

The main advantage of this approach is that it can be applied when there are missing y-components in some observations. A typical use case for CompositeDataSets are time series where multiple quantities are tracked but not every quantity is necessarily recorded at each time step. Example:

using DataFrames
t = [1,2,3,4]
y₁ = [2.5, 6, missing, 9];      y₂ = [missing, 5, 3.1, 1.4]
σ₁ = 0.3*ones(4);               σ₂ = [missing, 0.2, 0.1, 0.5]
df = DataFrame([t y₁ σ₁ y₂ σ₂], :auto)

xdim = 1;   ydim = 2
CompositeDataSet(df, xdim, ydim; xerrs=false, stripedYs=true)

The boolean-valued keywords stripedXs and stripedYs can be used to indicate to the constructor whether the values and corresponding $1\sigma$ uncertainties are given in alternating order, or whether the initial block of ydim many columns are the values and the second ydim many columns are the corresponding uncertainties. Also, xerrs=true can be used to indicate that the x-values also carry uncertainties. Basically all functions which can be called on other data containers such as DataSet have been specialized to also work with CompositeDataSets.

GeneralizedDataSet(dist::ContinuousMultivariateDistribution, dims::Tuple{Int,Int,Int}=(length(dist), 1, 1))

Data structure which can take general x-y-covariance into account where dims=(Npoints, xdim, ydim) indicates the dimensionality of the data. dist should constitute a smooth distribution over the space $\mathcal{X}^N \times \mathcal{Y}^N$ where mean(dist) is interpreted as the concatenation of the (most likely values for the) observations $(x_1, ..., x_N, y_1, ..., y_N)$ and the width of dist specifies the uncertainty in the signal. Typically, dist is a multivariate Gaussian but other distributions such as Cauchy or student's t-distributions are also possible. Thus, arbitrary correlations between the dependent $y$ and independent $x$ variables can be encoded.


If there is no correlation between the $x$ and $y$ variables (i.e. if the offdiagonal blocks of cov(dist) are zero), it can be more performant to use the type DataSetExact to encode the given data instead.

DataSetUncertain(x::AbstractVector, y::AbstractVector, σ⁻¹::Function, c::AbstractVector)

The DataSetUncertain type encodes data for which the size of the variance is unknown a-priori but whose error is specified via an error model of the form σ(x, y_pred, c) where c is a vector of error parameters. This parametrized error model is subsequently used to estimate the standard deviations in the observations y.


To enhance performance, the implementation actually requires the specification of a reciprocal error model, i.e. a function σ⁻¹(x, y_pred, c).

To construct a DataSetUncertain, one has to specify a vector of independent variables x, a vector of dependent variables y, a reciprocal error model σ⁻¹(x, y_pred, c) and an initial guess for the vector of error parameters c.


In the simplest case, where all data points are mutually independent and have a single $x$-component and a single $y$-component each, a DataSet consisting of four points can be constructed via

DS = DataSetUncertain([1,2,3,4], [4,5,6.5,7.8], (x,y,c)->1/exp10(c[1]), [0.5])

It is generally advisable to exponentiate error parameters, since they are penalized poportional to log(c) in the normalization term of Gaussian likelihoods.


A Bessel correction sqrt((length(ydata(DS))-length(params))/length(ydata(DS))) can be applied to the reciprocal error to account for the fact that the maximum likelihood estimator for the variance is biased via kwarg BesselCorrection.
