Providing Datasets
Typically, one of the most difficult parts of any data science problem is to bring the data into a form which lends itself to the subsequent analysis. This section aims to describe the containers used by InformationGeometry.jl to store datasets and models in detail.
The data itself is stored using the DataSet
container.
InformationGeometry.DataSet
— TypeThe DataSet
type is a versatile container for storing data. Typically, it is constructed by passing it three vectors x
, y
, sigma
where the components of sigma
quantify the standard deviation associated with each y-value. Alternatively, a full covariance matrix can be supplied for the ydata
instead of a vector of standard deviations. The contents of a DataSet
DS
can later be accessed via xdata(DS)
, ydata(DS)
, ysigma(DS)
.
Examples:
In the simplest case, where all data points are mutually independent and have a single $x$-component and a single $y$-component each, a DataSet
consisting of four points can be constructed via
DataSet([1,2,3,4], [4,5,6.5,7.8], [0.5,0.45,0.6,0.8])
or alternatively by
using LinearAlgebra
DataSet([1,2,3,4], [4,5,6.5,7.8], Diagonal([0.5,0.45,0.6,0.8].^2))
where the diagonal covariance matrix in the second line is equivalent to the vector of standard deviations supplied in the first line.
For measurements with multiple components, it is also possible to enter them as a Matrix
where the columns correspond to the respective components.
DataSet([0, 0.5, 1], [1 100; 2 103; 3 108], [0.5 8; 0.4 5; 0.6 10])
Note that if the uncertainty matrix is square, it may be falsely interpreted as a covariance matrix instead of as the columnwise specification of standard deviations.
More generally, if a dataset consists of $N$ points where each $x$-value has $n$ many components and each $y$-value has $m$ many components, this can be specified to the DataSet
constructor via a tuple $(N,n,m)$ in addition to the vectors x
, y
and the covariance matrix. For example:
X = [0.9, 1.0, 1.1, 1.9, 2.0, 2.1, 2.9, 3.0, 3.1, 3.9, 4.0, 4.1]
Y = [1.0, 5.0, 4.0, 8.0, 9.0, 13.0, 16.0, 20.0]
Cov = Diagonal([2.0, 4.0, 2.0, 4.0, 2.0, 4.0, 2.0, 4.0])
dims = (4, 3, 2)
DS = DataSet(X, Y, Cov, dims)
In this case, X
is a vector consisting of the concatenated x-values (with 3 components each) for 4 different data points. The values of Y
are the corresponding concatenated y-values (with 2 components each) of said 4 data points. Clearly, the covariance matrix must therefore be a positive-definite $(m \cdot N) \times (m \cdot N)$ matrix.
To complete the specification of an inference problem, a model function which is assumed to be able to capture the relationship which is inherent in the data must be added.
InformationGeometry.DataModel
— TypeIn addition to storing a DataSet
, a DataModel
also contains a function model(x,θ)
and its derivative dmodel(x,θ)
where x
denotes the x-value of the data and θ
is a vector of parameters on which the model depends. Crucially, dmodel
contains the derivatives of the model with respect to the parameters θ
, not the x-values. For example
DS = DataSet([1,2,3,4], [4,5,6.5,7.8], [0.5,0.45,0.6,0.8])
model(x::Number, θ::AbstractVector{<:Number}) = θ[1] * x + θ[2]
DM = DataModel(DS, model)
In cases where the output of the model has more than one component (i.e. ydim > 1
), it is advisable to define the model function in such a way that it outputs static vectors using StaticArrays.jl for increased performance. For ydim = 1
, InformationGeometry.jl expects the model to output a number instead of a vector with one component. In contrast, the parameter configuration θ
must always be supplied as a vector (even if it only has a single component).
An initial guess for the maximum likelihood parameters can optionally be passed to the DataModel
as a vector via
DM = DataModel(DS, model, [1.0,2.5])
During the construction of a DataModel
process which includes the search for the maximum likelihood estimate $\theta_\text{MLE}$, multiple tests are run. If necessary, the maximum likelihood estimation and subsequent tests are both skipped by appending true
as the last argument in the constructor or by using the respective kwargs SkipTests=false
or SkipOptim=false
:
DM = DataModel(DS, model, [-Inf,π,1], true)
If a DataModel
is constructed as shown in the above examples, the gradient of the model with respect to the parameters θ
(i.e. its "Jacobian") will be calculated using automatic differentiation. Alternatively, an explicit analytic expression for the Jacobian can be specified by hand:
using StaticArrays
function dmodel(x::Number, θ::AbstractVector{<:Number})
@SMatrix [x 1.] # ∂(model)/∂θ₁ and ∂(model)/∂θ₂
end
DM = DataModel(DS, model, dmodel)
The output of the Jacobian must be a matrix whose columns correspond to the partial derivatives with respect to different components of θ
and whose rows correspond to evaluations at different components of x
. Again, although it is not strictly required, outputting the Jacobian in form of a static matrix is typically beneficial for the overall performance.
It is also possible to specify a (logarithmized) prior distribution on the parameter space to the DataModel
constructor after the initial guess for the MLE. For example:
using Distributions
Dist = MvNormal(ones(2), [1 0; 0 3.])
LogPriorFn(θ) = logpdf(Dist, θ)
DM = DataModel(DS, model, [1.0,2.5], LogPriorFn)
The DataSet
contained in a DataModel
named DM
can be accessed via Data(DM)
, whereas the model and its Jacobian can be used via Predictor(DM)
and dPredictor(DM)
respectively. The MLE and the value of the log-likelihood at the MLE are accessible via MLE(DM)
and LogLikeMLE(DM)
. The logarithmized prior can be accessed via LogPrior(DM)
.
"Simple" DataSet
s and DataModel
s can be visualized directly via plot(DM)
using pre-written recipes for the Plots.jl package.
A more structured container for models is ModelMap
, which additionally stores information about custom parameter bounds, parameter names, arbitrary nonlinear constraints, etc.:
InformationGeometry.ModelMap
— TypeModelMap(Map::Function, InDomain::Union{Nothing,Function}, Domain::HyperCube; startp::AbstractVector)
ModelMap(Map::Function, InDomain::Function, xyp::Tuple{Int,Int,Int})
A container which stores additional information about a model map, in particular its domain of validity. Map
is the actual map (x,θ) -> model(x,θ)
. Domain
is a HyperCube
which allows one to roughly specify the ranges of the various parameters. For more complicated boundary constraints, a function InDomain(θ)
can be specified, for which all outputted components should be .≥ 0 on the valid parameter domain. Alternatively, InDomain
may also be a bool-valued function, evaluating to true
in admissible parts of the parameter domain.
The kwarg startp
may be used to pass a suitable parameter vector for the ModelMap.
A Bool
-valued function which returns true
in the valid domain also fits this description, which allows one to easily combine multiple constraints. Providing this information about the domain can be advantageous in the optimization process for complicated models.