Advanced Datasets
Depending on the nature of the dataset that is analyzed, there are multiple data types implemented by InformationGeometry.jl to store them in. Mainly, these data types provide a trade-off in speed / simplicity versus flexibility / generality as illustrated by the following table:
Container | allows non-Gaussian y -uncertainty | allows missing values | allows x -uncertainty | allows mixed x -y uncertainty | allows y -uncertainty estimation | allows x -uncertainty estimation |
---|---|---|---|---|---|---|
DataSet | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
DataSetExact | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ |
CompositeDataSet | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
GeneralizedDataSet | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
DataSetUncertain | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
UnknownVarianceDataSet | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
InformationGeometry.DataSetExact
— TypeDataSetExact(x::AbstractArray, y::AbstractArray, Σ_y::AbstractArray)
DataSetExact(x::AbstractArray, Σ_x::AbstractArray, y::AbstractArray, Σ_y::AbstractArray)
DataSetExact(xd::Distribution, yd::Distribution, dims::Tuple{Int,Int,Int}=(length(xd),1,1))
A data container which allows for uncertainties in the independent variables, i.e. $x$-variables. Moreover, the observed data is stored in terms of two probability distributions over the spaces $\mathcal{X}^N$ and $\mathcal{Y}^N$ respectively, which also allows for uncertainties in the observations that are non-Gaussian. For instance, the uncertainties associated with a given observation might follow a Cauchy, t-student, log-normal or some other smooth distribution.
Examples:
using InformationGeometry, Distributions
X = product_distribution([Normal(0, 1), Cauchy(2, 0.5)])
Y = MvTDist(2, [3, 8.], [1 0.5; 0.5 3])
DataSetExact(X, Y, (2,1,1))
Uncertainties in the independent $x$-variables are optional for DataSetExact
, and can be set to zero by wrapping the x
-data in a InformationGeometry.Dirac
"distribution". The following illustrates numerically equivalent ways of encoding a dataset whose uncertainties in the $x$-variables is zero:
using InformationGeometry, Distributions, LinearAlgebra
DS1 = DataSetExact(InformationGeometry.Dirac([1,2]), MvNormal([5,6], Diagonal([0.1, 0.2].^2)))
DS2 = DataSetExact([1,2], [5,6], [0.1, 0.2])
DS3 = DataSet([1,2], [5,6], [0.1, 0.2])
where DS1 == DS2 == DS3
will evaluate to true
.
InformationGeometry.CompositeDataSet
— TypeThe CompositeDataSet
type is a more elaborate (and typically less performant) container for storing data. Essentially, it splits observed data which has multiple y
-components into separate data containers (e.g. of type DataSet
), each of which corresponds to one of the components of the y
-data. Crucially, each of the smaller data containers still shares the same "kind" of x
-data, that is, the same xdim
, units and so on, although they do not need to share the exact same particular x
-data.
The main advantage of this approach is that it can be applied when there are missing
y
-components in some observations. A typical use case for CompositeDataSet
s are time series where multiple quantities are tracked but not every quantity is necessarily recorded at each time step. Example:
using DataFrames
t = [1,2,3,4]
y₁ = [2.5, 6, missing, 9]; y₂ = [missing, 5, 3.1, 1.4]
σ₁ = 0.3*ones(4); σ₂ = [missing, 0.2, 0.1, 0.5]
df = DataFrame([t y₁ σ₁ y₂ σ₂], :auto)
xdim = 1; ydim = 2
CompositeDataSet(df, xdim, ydim; xerrs=false, stripedYs=true)
The boolean-valued keywords stripedXs
and stripedYs
can be used to indicate to the constructor whether the values and corresponding $1\sigma$ uncertainties are given in alternating order, or whether the initial block of ydim
many columns are the values and the second ydim
many columns are the corresponding uncertainties. Also, xerrs=true
can be used to indicate that the x
-values also carry uncertainties. Basically all functions which can be called on other data containers such as DataSet
have been specialized to also work with CompositeDataSet
s.
InformationGeometry.GeneralizedDataSet
— TypeGeneralizedDataSet(dist::ContinuousMultivariateDistribution, dims::Tuple{Int,Int,Int}=(length(dist), 1, 1))
Data structure which can take general x-y-covariance into account where dims=(Npoints, xdim, ydim)
indicates the dimensionality of the data. dist
should constitute a smooth distribution over the space $\mathcal{X}^N \times \mathcal{Y}^N$ where mean(dist)
is interpreted as the concatenation of the (most likely values for the) observations $(x_1, ..., x_N, y_1, ..., y_N)$ and the width of dist
specifies the uncertainty in the signal. Typically, dist
is a multivariate Gaussian but other distributions such as Cauchy or student's t-distributions are also possible. Thus, arbitrary correlations between the dependent $y$ and independent $x$ variables can be encoded.
If there is no correlation between the $x$ and $y$ variables (i.e. if the offdiagonal blocks of cov(dist)
are zero), it can be more performant to use the type DataSetExact
to encode the given data instead.
InformationGeometry.DataSetUncertain
— TypeDataSetUncertain(x::AbstractVector, y::AbstractVector, σ⁻¹::Function, c::AbstractVector; BesselCorrection::Bool=false)
DataSetUncertain(x::AbstractVector, y::AbstractVector, σ⁻¹::Function, errorparamsplitter::Function, c::AbstractVector, dims::Tuple{Int,Int,Int}; BesselCorrection::Bool=false)
The DataSetUncertain
type encodes data for which the size of the variance is unknown a-priori but whose error is specified via an error model of the form σ(x, y_pred, c)
where c
is a vector of error parameters. This parametrized error model is subsequently used to estimate the standard deviations in the observations y
.
To enhance performance, the implementation actually requires the specification of a reciprocal error model, i.e. a function σ⁻¹(x, y_pred, c)
. If ydim
is larger than one, the reciprocal error model should output a matrix, i.e. the cholesky decomposition S
of the covariance Σ
such that Σ == S' * S
.
To construct a DataSetUncertain
, one has to specify a vector of independent variables x
, a vector of dependent variables y
, a reciprocal error model σ⁻¹(x, y_pred, c)
and an initial guess for the vector of error parameters c
. Optionally, an explicit errorparamsplitter
function of the form θ -> (modelparams, errorparams)
may be specified, which splits the parameters into a tuple of model parameters, which are subsequently forwarded into the model, and error parameters c
, which are only passed to the reciprocal error model σ⁻¹
.
The parameters which are visible to the outside are processed by errorparamsplitter
FIRST, before forwarding into the model, where modelparams
might be further modified by embedding transformations.
Examples:
In the simplest case, where all data points are mutually independent and have a single $x$-component and a single $y$-component each, a DataSet
consisting of four points can be constructed via
DS = DataSetUncertain([1,2,3,4], [4,5,6.5,7.8], (x,y,c)->1/exp10(c[1]), [0.5])
It is generally advisable to exponentiate error parameters, since they are penalized poportional to log(c)
in the normalization term of Gaussian likelihoods. A Bessel correction sqrt((length(ydata(DS))-length(params))/length(ydata(DS)))
can be applied to the reciprocal error to account for the fact that the maximum likelihood estimator for the variance is biased via kwarg BesselCorrection
.