Advanced Datasets

Depending on the nature of the dataset that is analyzed, there are multiple data types implemented by InformationGeometry.jl to store them in. Mainly, these data types provide a trade-off in speed / simplicity versus flexibility / generality as illustrated by the following table:

Containerallows non-Gaussian y-uncertaintyallows missing valuesallows x-uncertaintyallows mixed x-y uncertaintyallows y-uncertainty estimationallows x-uncertainty estimation
DataSet
DataSetExact
CompositeDataSet
GeneralizedDataSet
DataSetUncertain
UnknownVarianceDataSet
InformationGeometry.DataSetExactType
DataSetExact(x::AbstractArray, y::AbstractArray, Σ_y::AbstractArray)
DataSetExact(x::AbstractArray, Σ_x::AbstractArray, y::AbstractArray, Σ_y::AbstractArray)
DataSetExact(xd::Distribution, yd::Distribution, dims::Tuple{Int,Int,Int}=(length(xd),1,1))

A data container which allows for uncertainties in the independent variables, i.e. $x$-variables. Moreover, the observed data is stored in terms of two probability distributions over the spaces $\mathcal{X}^N$ and $\mathcal{Y}^N$ respectively, which also allows for uncertainties in the observations that are non-Gaussian. For instance, the uncertainties associated with a given observation might follow a Cauchy, t-student, log-normal or some other smooth distribution.

Examples:

using InformationGeometry, Distributions
X = product_distribution([Normal(0, 1), Cauchy(2, 0.5)])
Y = MvTDist(2, [3, 8.], [1 0.5; 0.5 3])
DataSetExact(X, Y, (2,1,1))
Note

Uncertainties in the independent $x$-variables are optional for DataSetExact, and can be set to zero by wrapping the x-data in a InformationGeometry.Dirac "distribution". The following illustrates numerically equivalent ways of encoding a dataset whose uncertainties in the $x$-variables is zero:

using InformationGeometry, Distributions, LinearAlgebra
DS1 = DataSetExact(InformationGeometry.Dirac([1,2]), MvNormal([5,6], Diagonal([0.1, 0.2].^2)))
DS2 = DataSetExact([1,2], [5,6], [0.1, 0.2])
DS3 = DataSet([1,2], [5,6], [0.1, 0.2])

where DS1 == DS2 == DS3 will evaluate to true.

source
InformationGeometry.CompositeDataSetType

The CompositeDataSet type is a more elaborate (and typically less performant) container for storing data. Essentially, it splits observed data which has multiple y-components into separate data containers (e.g. of type DataSet), each of which corresponds to one of the components of the y-data. Crucially, each of the smaller data containers still shares the same "kind" of x-data, that is, the same xdim, units and so on, although they do not need to share the exact same particular x-data.

The main advantage of this approach is that it can be applied when there are missing y-components in some observations. A typical use case for CompositeDataSets are time series where multiple quantities are tracked but not every quantity is necessarily recorded at each time step. Example:

using DataFrames
t = [1,2,3,4]
y₁ = [2.5, 6, missing, 9];      y₂ = [missing, 5, 3.1, 1.4]
σ₁ = 0.3*ones(4);               σ₂ = [missing, 0.2, 0.1, 0.5]
df = DataFrame([t y₁ σ₁ y₂ σ₂], :auto)

xdim = 1;   ydim = 2
CompositeDataSet(df, xdim, ydim; xerrs=false, stripedYs=true)

The boolean-valued keywords stripedXs and stripedYs can be used to indicate to the constructor whether the values and corresponding $1\sigma$ uncertainties are given in alternating order, or whether the initial block of ydim many columns are the values and the second ydim many columns are the corresponding uncertainties. Also, xerrs=true can be used to indicate that the x-values also carry uncertainties. Basically all functions which can be called on other data containers such as DataSet have been specialized to also work with CompositeDataSets.

source
InformationGeometry.GeneralizedDataSetType
GeneralizedDataSet(dist::ContinuousMultivariateDistribution, dims::Tuple{Int,Int,Int}=(length(dist), 1, 1))

Data structure which can take general x-y-covariance into account where dims=(Npoints, xdim, ydim) indicates the dimensionality of the data. dist should constitute a smooth distribution over the space $\mathcal{X}^N \times \mathcal{Y}^N$ where mean(dist) is interpreted as the concatenation of the (most likely values for the) observations $(x_1, ..., x_N, y_1, ..., y_N)$ and the width of dist specifies the uncertainty in the signal. Typically, dist is a multivariate Gaussian but other distributions such as Cauchy or student's t-distributions are also possible. Thus, arbitrary correlations between the dependent $y$ and independent $x$ variables can be encoded.

Note

If there is no correlation between the $x$ and $y$ variables (i.e. if the offdiagonal blocks of cov(dist) are zero), it can be more performant to use the type DataSetExact to encode the given data instead.

source
InformationGeometry.DataSetUncertainType
DataSetUncertain(x::AbstractVector, y::AbstractVector, σ⁻¹::Function, c::AbstractVector)

The DataSetUncertain type encodes data for which the size of the variance is unknown a-priori but whose error is specified via an error model of the form σ(x, y_pred, c) where c is a vector of error parameters. This parametrized error model is subsequently used to estimate the standard deviations in the observations y.

Note

To enhance performance, the implementation actually requires the specification of a reciprocal error model, i.e. a function σ⁻¹(x, y_pred, c).

To construct a DataSetUncertain, one has to specify a vector of independent variables x, a vector of dependent variables y, a reciprocal error model σ⁻¹(x, y_pred, c) and an initial guess for the vector of error parameters c.

Examples:

In the simplest case, where all data points are mutually independent and have a single $x$-component and a single $y$-component each, a DataSet consisting of four points can be constructed via

DS = DataSetUncertain([1,2,3,4], [4,5,6.5,7.8], (x,y,c)->1/exp10(c[1]), [0.5])
Note

It is generally advisable to exponentiate error parameters, since they are penalized poportional to log(c) in the normalization term of Gaussian likelihoods.

Note

A Bessel correction sqrt((length(ydata(DS))-length(params))/length(ydata(DS))) can be applied to the reciprocal error to account for the fact that the maximum likelihood estimator for the variance is biased via kwarg BesselCorrection.

source