OpenPathSampling Storage
from __future__ import print_function
# if our large test file is available, use it. Otherwise, use file generated from toy_mstis_2_run.ipynb
import os
test_file = "../toy_mstis_1k_OPS1.nc"
filename = test_file if os.path.isfile(test_file) else "mstis.nc"
print("Using file `"+ filename + "` for analysis")
An introduction to Storage¶
Introduction¶
All we need is contained in the openpathsampling
package
import openpathsampling as paths
The storage itself is mainly a netCDF file and can also be used as such. Technically it is a subclass of netCDF4.Dataset
and can use all of its functions in case we want to add additional tables to the file besides what we store using stores. You can of course also add new stores to the storage. Using Storage()
will automatically create a set of needed storages when a new file is created.
netCDF files are very generic while our Storage is more tuned to needs we have. It support etc native support for simtk.units, and can recursively store nested objects using JSON pickling. But we will get to that.
Open the output from the 'alanine.ipynb' notebook to have something to work with
storage = paths.Storage(filename, mode='r')
storage
save_storage = paths.Storage("tmp_storage.nc", mode='w')
and have a look at what stores are available
print(storage.list_stores())
cv = storage.cvs['opA']
cv(storage.trajectories[0])[0:10]
and we can access all of these using
snapshot_store = storage.snapshots
Stores are lists¶
In general it is useful to think about the storage as a set of lists. Each of these lists contain objects of the same type, e.g. Sample
, Trajectory
, Ensemble
, Volume
, ... The class instances used to access elements from the storage are called a store. Imagine you go into a store to buy and sell objects (luckily our stores are free). All the stores share the same storage space, which is a netCDF file on disc.
Still, a store is not really a list or subclassed from a list, but it almost acts like one.
print('We have %d snapshots in our storage' % len(storage.snapshots))
Loading objects¶
In the same way we access lists we can also access these lists using slicing, and even lists of indices.
Load by slicing
print(storage.samples[2:4])
Load by list of indices
print(storage.ensembles[[0,1,3]])
Saving objects¶
Saving is somehow special, since we try to deal exclusively with immutable objects. That means that once an object is saved, it cannot be changed. This is not completely true, since the netCDF file allow changing, but we try not to do it. The only exeption are collective variables, these can store their cached values and we want to store intermediate states so we add new values once we have computed these. This should be the only exception and we use the .sync
command to update the status of a once saved collectivevariable
Saving is easy. Just use .save
on the store
# storage.samples.save(my_sample)
and it will add the object to the end of our store list or do nothing, if the object has already been stored. It is important to note, that each object knows, if it has been stored already. This allows to write nice recursive saving without worrying that we save the same object several times.
You can also store directly using the storage. Both is fine and the storage just delegates the task to the appropriate store.
# storage.save(my_sample)
For completeness you can also use __setitem__
to save, but since you cannot explicitely set the number you have to use None
as the key or Ellipsis
, ...
is fine.
# storage.samples[None] = my_sample
# storage.samples[Ellipsis] = my_sample
# storage.samples[...] = my_sample
I mentioned recursive saving. This does the following. Imagine a sample snapshot
which itself has a Configuration
and a Momentum
object. If you store the snapshot it also store the content using the approriate stores. This can be arbitrarily complex. And most object can be either stored in a special way or get converted into a JSON string that we can turn into an object again. Python has something like this build it, which works similar, but we needed something that add the recursive storage connection and uses JSON. If you are curious, the json string can be accessed for some objects using .json
but is only available for loaded or saved objects. It will not be computed unless it is used.
volume = storage.volumes[1]
print(storage.repr_json(volume))
Naming¶
Load by name
print(storage.volumes['A'].name)
Equivalent to using .find(<name>)
print(storage.volumes.find('A').name)
Names can exist multiple times. To find all use .find_all()
which returns a list of objects.
print(storage.volumes.find_all('A'))
To get indices only use .find_indices(<name>)
print(storage.volumes.find_indices('A'))
A look at the internal list of names
storage.volumes.name_idx
Objects can be saved by name. To do this we need a new objects that we can actually save. All loaded objects cannot be saved again of course.
empty_ensemble = paths.EmptyEnsemble()
full_ensemble = paths.EmptyEnsemble()
Now store as you would set a dictionary
len(storage.ensembles)
save_storage.ensembles['empty'] = empty_ensemble
print(save_storage.ensembles.index[empty_ensemble.__uuid__])
storage.variables['ensembles_json'][70:80]
Alternatively you can use save.
save_storage.ensembles.save(full_ensemble, 'full')
And the ensemble now has the .name
property set
print(empty_ensemble.name)
print(full_ensemble.name)
tags¶
In storage exists a special store name tag
. This is to reference any object and mostly to name stuff for later easy access.
print(len(storage.tag))
for t in storage.tag:
print(t)
storage.tag.keys()
for name, obj in storage.tag.iteritems():
print('{:20s} : {:20s}'.format(name, obj.__class__.__name__))
dict(storage.tag)
Indexing¶
Each loaded object is equipped with a .idx
attribute which is a dictionary that contains the index for a specific storage. This is necessary since we can - in theory - store an object in several different stores at once and these might have different indices. Note that idx is NOT a function, but a dictionary, hence the square brackets.
samp = storage.samples[0]
print(storage.idx(samp))
print(samp.idx(storage))
print(samp.idx(storage.samples))
Iterators¶
A list is iterable and so is a store. Lets load all ensembles and list their names
[ens.name for ens in storage.ensembles][2:4]
Maybe you have realized that some command run slower the first time. This is because we use caching and once an object is loaded it stays in memory and can be accessed much faster.
Searching for objects¶
One way to find objects is to use their name, which I mentioned before, but in general there are no search functions, but we can use python notation in the usual way to load what we need. List comprehensions is the magic word. Say, we want to get all snapshots that are reversed. We could just load all of these and filter them, but there is a more elegant way to do that, or let's say a more elegant way of writing it in python, because the underlying code does just that.
stA = storage.volumes['A']
first_5000_snaps = storage.snapshots[0:5000]
reversed_samples = [snapshot for snapshot in first_5000_snaps if stA(snapshot)]
print('We found %d snapshots in StateA among %d total snapshots' % (len(reversed_samples), len(first_5000_snaps)))
Lets do something more useful: For TIS ensemble we want statistics on pathlengths associated with sampled trajectories Sample
objects that are sampled for a specific ensemble. And we one want samples that have been generated in our production runs and are present in a SampleSet
TODO: add a way to select only specific SampleSets
print(storage.samplesets[0])
my_network = storage.networks[0]
my_ensemble = my_network.sampling_ensembles[0]
relevant_samples = [
sample
for sample_set in storage.samplesets
for sample in sample_set
if sample.ensemble is my_ensemble
]
print(len(relevant_samples))
and finally compute the average length
list_of_path_lengths = [
len(sample.trajectory)
for sample_set in storage.samplesets
for sample in sample_set
if sample.ensemble is my_ensemble
]
print(list_of_path_lengths)
if len(list_of_path_lengths) > 0:
mean = float(sum(list_of_path_lengths))/len(list_of_path_lengths)
else:
mean = 0.0 # actually, it is not defined, so we just set it to zero
print(mean)
Allright, we loaded from a bootstrapping sampling algorithm and the analysis is pointless, but still it is rather short considering what we just did.
Generator expression¶
There is another very cool feature about python that is worth noting: generator expressions. Before we used list comprehensions to generate a list of all that we need, but what, if we don't want the whole list at once? Maybe that is impossible because of too much memory and also not desirable? We can do the same thing as above using a generator (although it would only be useful if we had to average over billions of samples). So assume the list of lengths is too large for memory. The summing does not mind to use little pieces so we construct a function that always gives us the next element. These functions are called iterators and to make these iteratore there is syntactic way to create them easily: Instead of square brackets in in list comprehensions use round brackets. So the above example would look like this
iterator_over_path_lengths = (
len(sample.trajectory)
for sample_set in storage.samplesets
for sample in sample_set
if sample.ensemble is my_ensemble
)
total = float(sum(iterator_over_path_lengths))
print(total)
Note that we now have a generator and no computed values yet. If we iterator using our iterator called generator it will pass one value at a time and we can use it in sum as we did before. There are two important things to note. Once an iteratore is used, it is consumed and we cannot just be run again so we need to change the code again. I assume there are other ways to do that, too
iterator_over_path_lengths = (
len(sample.trajectory)
for sample_set in storage.samplesets
for sample in sample_set
if sample.ensemble is my_ensemble
)
total = 0
count = 0
for length in iterator_over_path_lengths:
total += length
count += 1
if count > 0:
mean = float(total)/count
else:
mean = 0.0 # actually, it is not defined, so we just set it to zero
print(mean)
Voilà, this time without computing all length before!
A last example that will be interesting is the statistics on acceptance. Each sample knows which mover was involved in its creation. This is stored in .details.mover
in the .details
attribute. Let us try to look at only forward moves
ff_movers = list(filter(lambda self : type(self) == paths.ForwardShootMover, storage.pathmovers))
ff_movers
if len(ff_movers) > 2:
mover = ff_movers[2]
print("Use a '%s' for ensemble(s) '%s'" % ( mover.cls, mover.ensemble.name ))
I use a little trick here, notice that we use a list comprehension inside of a function call, this actually uses the generator expression and passes the resulting iterator to the .join
function.
Now to get statistics on acceptances