Serialization, loading and saving of zfit objects

Serialization, loading and saving of zfit objects#

The long-term goal is to be able to save and load zfit objects, such as models, spaces, parameters, etc. This is not yet fully implemented, but some parts are already available, some stable, some more experimental.

Overview:

Binary (pickle) loading and dumping of (frozen) FitResult is fully available
Human-readable serialization (also summarized under HS3) of
parameters and models is available, but not yet stable
losses and datasets are not yet available

import os
import pathlib
import pickle
from pprint import pprint

import mplhep
import numpy as np
import zfit
import zfit.z.numpy as znp
from matplotlib import pyplot as plt
from zfit import z

No module named 'nlopt'

mu = zfit.Parameter("mu", 1.2, -4, 5)
sigma = zfit.Parameter("sigma", 3, 0, 10)
obs = zfit.Space("obs1", limits=(-10, 20))
model = zfit.pdf.Gauss(mu=mu, sigma=sigma, obs=obs)

data = model.sample(10000)
loss = zfit.loss.UnbinnedNLL(model=model, data=data)

minimizer = zfit.minimize.Minuit()

x = np.linspace(*obs.limit1d, 1000)
mu.set_value(1.5)
sigma.set_value(2)
mplhep.histplot(data.to_binned(50), density=True, label="data")
plt.plot(x, model.pdf(x), label="model pre fit")
result = minimizer.minimize(loss)
plt.plot(x, model.pdf(x), label="model post fit")
plt.legend()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[5], line 4
      2 mu.set_value(1.5)
      3 sigma.set_value(2)
----> 4 mplhep.histplot(data.to_binned(50), density=True, label="data")
      5 plt.plot(x, model.pdf(x), label="model pre fit")
      6 result = minimizer.minimize(loss)

File ~/checkouts/readthedocs.org/user_builds/zfit-tutorials/envs/master/lib/python3.11/site-packages/mplhep/plot.py:191, in histplot(H, bins, yerr, w2, w2method, stack, density, binwnorm, histtype, xerr, label, sort, edges, binticks, ax, flow, **kwargs)
    189 flow_bins = final_bins
    190 for i, h in enumerate(hists):
--> 191     value, variance = h.values().copy(), h.variances()
    192     if variance is not None:
    193         variance = variance.copy()

File ~/checkouts/readthedocs.org/user_builds/zfit-tutorials/envs/master/lib/python3.11/site-packages/tensorflow/python/framework/tensor.py:261, in Tensor.__getattr__(self, name)
    253 if name in {"T", "astype", "ravel", "transpose", "reshape", "clip", "size",
    254             "tolist", "data"}:
    255   # TODO(wangpeng): Export the enable_numpy_behavior knob
    256   raise AttributeError(
    257       f"{type(self).__name__} object has no attribute '{name}'. " + """
    258     If you are looking for numpy-related methods, please run the following:
    259     tf.experimental.numpy.experimental_enable_numpy_behavior()
    260   """)
--> 261 self.__getattribute__(name)

AttributeError: 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'copy'

../../_images/de9ae71e4db8687c1ddb920f3ac8679364abc7ece96cd02fa3f38170ccbb25fa.png

result.freeze()
dumped_result = pickle.dumps(result)

loaded_result = pickle.loads(dumped_result)

mu.set_value(0.42)
print(f"mu before: {mu.value()}")
zfit.param.set_values(params=model.get_params(), values=loaded_result)
print(f"mu after: {mu.value()}, set to result value: {loaded_result.params[mu]['value']}")

Human-readable serialization (HS3)#

WARNING: this section is unstable and, apart from dumping for publishing on a “if it works, great” basis, everything else is recommended for power users only and will surely break in the future.

HS3 is the “hep-statistics-serialization-standard”, that is currently being developed and aims to provide a human-readable serialization format for loading and dumping of the likelihood. It is not stable and neither is the implementation of it in zfit (which also doesn’t follow it strictly for different reasons currently).

We can either dump objects in the library directly, or create a complete dump to an HS3-like format.

model.to_dict()

mu.to_dict()

obs.to_dict()

Recreate the object#

We can also recreate the object from the dictionary. As a simple example, let’s do this for the model.

gauss2 = model.from_dict(model.to_dict())  # effectively creates a copy (parameters are shared!)
gauss2

This is a bit of cheating, since we could use the model itself to use the from_dict (or more generally, the from_* methods). More generally, in this case, we need to know the class of the object (currently) in order to convert it back (this is not the case for the HS3 dumping below).

gauss3 = zfit.pdf.Gauss.from_dict(model.to_dict())

Dumping and loading#

These representations can be converted to anything JSON/YAML like. In fact, the objects already offer out-of-the-box some conversion methods.

sigma.to_json()

sigma.to_yaml()

Serializing large datasets#

We can also serialize data objects. However, binned data can be large (i.e. in the millions) and are theferore not suitable to be stored in plain text (which requires typically a factor of 10 more space). Therefore, we can use the to_asdf method to store the data in a binary format. This will convert any numpy-array into a binary format while just keeping a reference instead.

data.to_dict()

As we can see, naturally the whole data array is saved. Trying to convert this to JSON or YAML will fail as these dumpers by default cannot handle numpy arrays (one could convert the numpy arrays to lists, but the problem with space will remain).

try:
    data.to_json()
except TypeError as error:
    print(error)

Let’s follow the advice!

data_asdf = data.to_asdf()
data_asdf

ASDF format#

The ASDF format stands for Advanced Scientific Data Format. It is a mixture of yaml and a binary format that can store arbitrary data, including numpy arrays, pandas dataframes, astropy tables, etc.

Two attributes are convenient to know:

tree: returns the dict representation of the data
write_to(path): writes the data to a file in path

data_asdf.tree

data_asdf.write_to("data.asdf")  # Will create a file in the current directory

We can inspect the file using the head command to printout the first 25 lines (out of a total of about 471!). As we can see, the beginning is a yaml representation of the data, while the end is a binary representation of the data (which produces weird signs). The file is not human-readable, but can be loaded by any ASDF library.

!head -25 data.asdf

!wc -l data.asdf  # the file is about 471 lines long, filled with binary

Loading can be done using the asdf library directly too.

import asdf

with asdf.open("data.asdf") as f:
    tree = f.tree
    data = zfit.Data.from_asdf(f)

data.value()

# cleanup of the file
import pathlib

pathlib.Path("data.asdf").unlink()

HS3 serialization#

To convert our objects into a HS3-like format, we can use the following functions. The format is not yet stable and will change in the future.

It is therefore recommended to try out: if it works, great. If it errors, fine. Don’t expect it to be able to load again in the future, but if it works, it’s nice for publication

Objects#

We can serialize the objects itself, PDFs, spaces etc. The difference to the above mentioned serialization with to_dict is that the HS3 serialization is more verbose and contains more information, such as metadata and fields for other objects (e.g. the parameters of a PDF). It will also fill in some of the fields by extracting the information from the object.

zfit.hs3.dumps(model)

hs3obj = zfit.hs3.loads(zfit.hs3.dumps(model))

list(hs3obj['distributions'].values())

Publishing#

While the format is being improved constantly, a likelihood created with this format can in principle be published, maybe alongside the paper. If we may want to omit the data and only publish the model, we can just create a HS3 object with the pdf instead of the likelihood.

hs3dumped = zfit.hs3.dumps(model)
pprint(hs3dumped)

hs3dumped = zfit.hs3.dumps(loss)
pprint(hs3dumped)

hs3dumped

zfit.hs3.loads(hs3dumped)