OpenPathSampling Storage

tutorial_storage
In [1]:
from __future__ import print_function
# if our large test file is available, use it. Otherwise, use file generated from toy_mstis_2_run.ipynb
import os
test_file = "../toy_mstis_1k_OPS1.nc"
filename = test_file if os.path.isfile(test_file) else "mstis.nc"
print("Using file `"+ filename + "` for analysis")
Using file `../toy_mstis_1k_OPS1.nc` for analysis

An introduction to Storage

Introduction

All we need is contained in the openpathsampling package

In [2]:
import openpathsampling as paths

The storage itself is mainly a netCDF file and can also be used as such. Technically it is a subclass of netCDF4.Dataset and can use all of its functions in case we want to add additional tables to the file besides what we store using stores. You can of course also add new stores to the storage. Using Storage() will automatically create a set of needed storages when a new file is created.

netCDF files are very generic while our Storage is more tuned to needs we have. It support etc native support for simtk.units, and can recursively store nested objects using JSON pickling. But we will get to that.

Open the output from the 'alanine.ipynb' notebook to have something to work with

In [3]:
storage = paths.Storage(filename, mode='r')
storage
Out[3]:
Storage @ '../toy_mstis_1k_OPS1.nc'
In [4]:
save_storage = paths.Storage("tmp_storage.nc", mode='w')

and have a look at what stores are available

In [5]:
print(storage.list_stores())
['stores', 'trajectories', 'topologies', 'snapshots', 'samples', 'samplesets', 'movechanges', 'steps', 'details', 'pathmovers', 'shootingpointselectors', 'engines', 'pathsimulators', 'transitions', 'networks', 'schemes', 'interfacesets', 'msouters', 'volumes', 'ensembles', 'tag', 'attributes', 'snapshot0', 'cv0', 'cv1', 'cv2']
In [6]:
cv = storage.cvs['opA']
In [7]:
cv(storage.trajectories[0])[0:10]
Out[7]:
[0.18995551764965057,
 0.2078997790813446,
 0.22507938742637634,
 0.24157929420471191,
 0.25697261095046997,
 0.27420175075531006,
 0.2925931513309479,
 0.3085809350013733,
 0.3231964707374573,
 0.3367229104042053]

and we can access all of these using

In [8]:
snapshot_store = storage.snapshots

Stores are lists

In general it is useful to think about the storage as a set of lists. Each of these lists contain objects of the same type, e.g. Sample, Trajectory, Ensemble, Volume, ... The class instances used to access elements from the storage are called a store. Imagine you go into a store to buy and sell objects (luckily our stores are free). All the stores share the same storage space, which is a netCDF file on disc.

Still, a store is not really a list or subclassed from a list, but it almost acts like one.

In [9]:
print('We have %d snapshots in our storage' % len(storage.snapshots))
We have 51332 snapshots in our storage

Loading objects

In the same way we access lists we can also access these lists using slicing, and even lists of indices.

Load by slicing

In [10]:
print(storage.samples[2:4])
[<Sample @ 0x11c124c18>, <Sample @ 0x11c154278>]

Load by list of indices

In [11]:
print(storage.ensembles[[0,1,3]])
[<openpathsampling.ensemble.TISEnsemble object at 0x11c1328d0>, <openpathsampling.ensemble.IntersectionEnsemble object at 0x11caa25c0>, <openpathsampling.ensemble.LengthEnsemble object at 0x11caa2630>]

Saving objects

Saving is somehow special, since we try to deal exclusively with immutable objects. That means that once an object is saved, it cannot be changed. This is not completely true, since the netCDF file allow changing, but we try not to do it. The only exeption are collective variables, these can store their cached values and we want to store intermediate states so we add new values once we have computed these. This should be the only exception and we use the .sync command to update the status of a once saved collectivevariable

Saving is easy. Just use .save on the store

In [12]:
# storage.samples.save(my_sample)

and it will add the object to the end of our store list or do nothing, if the object has already been stored. It is important to note, that each object knows, if it has been stored already. This allows to write nice recursive saving without worrying that we save the same object several times.

You can also store directly using the storage. Both is fine and the storage just delegates the task to the appropriate store.

In [13]:
# storage.save(my_sample)

For completeness you can also use __setitem__ to save, but since you cannot explicitely set the number you have to use None as the key or Ellipsis, ... is fine.

In [14]:
# storage.samples[None] = my_sample
# storage.samples[Ellipsis] = my_sample
# storage.samples[...] = my_sample

I mentioned recursive saving. This does the following. Imagine a sample snapshot which itself has a Configuration and a Momentum object. If you store the snapshot it also store the content using the approriate stores. This can be arbitrarily complex. And most object can be either stored in a special way or get converted into a JSON string that we can turn into an object again. Python has something like this build it, which works similar, but we needed something that add the recursive storage connection and uses JSON. If you are curious, the json string can be accessed for some objects using .json but is only available for loaded or saved objects. It will not be computed unless it is used.

In [15]:
volume = storage.volumes[1]
print(storage.repr_json(volume))
{"_cls":"UnionVolume","_dict":{"volume1":{"_hex_uuid":"0x833357f8dd4b11e7b495000000000022","_store":"volumes"},"volume2":{"_hex_uuid":"0x833357f8dd4b11e7b495000000000024","_store":"volumes"}}}

Naming

Load by name

In [16]:
print(storage.volumes['A'].name)
A

Equivalent to using .find(<name>)

In [17]:
print(storage.volumes.find('A').name)
A

Names can exist multiple times. To find all use .find_all() which returns a list of objects.

In [18]:
print(storage.volumes.find_all('A'))
[<openpathsampling.volume.CVDefinedVolume object at 0x11c132dd8>]

To get indices only use .find_indices(<name>)

In [19]:
print(storage.volumes.find_indices('A'))
[0]

A look at the internal list of names

In [20]:
storage.volumes.name_idx
Out[20]:
{'A': {0},
 'all states except A': {1},
 'B': {2},
 'C': {3},
 'all states except B': {13},
 'all states except C': {23}}

Objects can be saved by name. To do this we need a new objects that we can actually save. All loaded objects cannot be saved again of course.

In [21]:
empty_ensemble = paths.EmptyEnsemble()
full_ensemble = paths.EmptyEnsemble()

Now store as you would set a dictionary

In [22]:
len(storage.ensembles)
Out[22]:
157
In [23]:
save_storage.ensembles['empty'] = empty_ensemble
In [24]:
print(save_storage.ensembles.index[empty_ensemble.__uuid__])
0
In [25]:
storage.variables['ensembles_json'][70:80]
Out[25]:
array(['{"_cls":"TISEnsemble","_dict":{"min_overlap":{"_tuple":[0,0]},"initial_states":[{"_hex_uuid":"0x833357f8dd4b11e7b495000000000024","_store":"volumes"}],"final_states":[{"_hex_uuid":"0x9ad8903add4b11e7a11600000000042e","_store":"volumes"}],"max_overlap":{"_tuple":[0,0]},"interface":{"_hex_uuid":"0x9ad8903add4b11e7a1160000000002ba","_store":"volumes"},"orderparameter":{"_hex_uuid":"0x833357f8dd4b11e7b49500000000001e","_store":"attributes"},"lambda_i":null,"ensembles":[{"_hex_uuid":"0x9ad8903add4b11e7a116000000000458","_store":"ensembles"},{"_hex_uuid":"0x9ad8903add4b11e7a116000000000460","_store":"ensembles"},{"_hex_uuid":"0x9ad8903add4b11e7a116000000000468","_store":"ensembles"}],"greedy":false}}',
       '{"_cls":"IntersectionEnsemble","_dict":{"ensemble1":{"_hex_uuid":"0x9ad8903add4b11e7a116000000000454","_store":"ensembles"},"ensemble2":{"_hex_uuid":"0x9ad8903add4b11e7a116000000000456","_store":"ensembles"}}}',
       '{"_cls":"AllInXEnsemble","_dict":{"trusted":true,"volume":{"_hex_uuid":"0x833357f8dd4b11e7b495000000000024","_store":"volumes"}}}',
       '{"_cls":"LengthEnsemble","_dict":{"length":1}}',
       '{"_cls":"IntersectionEnsemble","_dict":{"ensemble1":{"_hex_uuid":"0x9ad8903add4b11e7a11600000000045c","_store":"ensembles"},"ensemble2":{"_hex_uuid":"0x9ad8903add4b11e7a11600000000045e","_store":"ensembles"}}}',
       '{"_cls":"AllOutXEnsemble","_dict":{"trusted":true,"volume":{"_hex_uuid":"0x9ad8903add4b11e7a11600000000045a","_store":"volumes"}}}',
       '{"_cls":"PartOutXEnsemble","_dict":{"trusted":true,"volume":{"_hex_uuid":"0x9ad8903add4b11e7a1160000000002ba","_store":"volumes"}}}',
       '{"_cls":"IntersectionEnsemble","_dict":{"ensemble1":{"_hex_uuid":"0x9ad8903add4b11e7a116000000000464","_store":"ensembles"},"ensemble2":{"_hex_uuid":"0x9ad8903add4b11e7a116000000000466","_store":"ensembles"}}}',
       '{"_cls":"AllInXEnsemble","_dict":{"trusted":true,"volume":{"_hex_uuid":"0x9ad8903add4b11e7a116000000000462","_store":"volumes"}}}',
       '{"_cls":"LengthEnsemble","_dict":{"length":1}}'], dtype=object)

Alternatively you can use save.

In [26]:
save_storage.ensembles.save(full_ensemble, 'full')
Out[26]:
224870977374347836148056640146643091776

And the ensemble now has the .name property set

In [27]:
print(empty_ensemble.name)
print(full_ensemble.name)
empty
full

tags

In storage exists a special store name tag. This is to reference any object and mostly to name stuff for later easy access.

In [28]:
print(len(storage.tag))
0
In [29]:
for t in storage.tag:
    print(t)
In [30]:
storage.tag.keys()
Out[30]:
dict_keys([])
In [31]:
for name, obj in storage.tag.iteritems():
    print('{:20s} : {:20s}'.format(name, obj.__class__.__name__))
In [32]:
dict(storage.tag)
Out[32]:
{}

Indexing

Each loaded object is equipped with a .idx attribute which is a dictionary that contains the index for a specific storage. This is necessary since we can - in theory - store an object in several different stores at once and these might have different indices. Note that idx is NOT a function, but a dictionary, hence the square brackets.

In [33]:
samp = storage.samples[0]
print(storage.idx(samp))
print(samp.idx(storage))
print(samp.idx(storage.samples))
0
0
None

Iterators

A list is iterable and so is a store. Lets load all ensembles and list their names

In [34]:
[ens.name for ens in storage.ensembles][2:4]
Out[34]:
['[AllInXEnsemble]', '[LengthEnsemble]']

Maybe you have realized that some command run slower the first time. This is because we use caching and once an object is loaded it stays in memory and can be accessed much faster.

Searching for objects

One way to find objects is to use their name, which I mentioned before, but in general there are no search functions, but we can use python notation in the usual way to load what we need. List comprehensions is the magic word. Say, we want to get all snapshots that are reversed. We could just load all of these and filter them, but there is a more elegant way to do that, or let's say a more elegant way of writing it in python, because the underlying code does just that.

In [35]:
stA = storage.volumes['A']
first_5000_snaps = storage.snapshots[0:5000]
reversed_samples = [snapshot for snapshot in first_5000_snaps if stA(snapshot)]
print('We found %d snapshots in StateA among %d total snapshots' % (len(reversed_samples), len(first_5000_snaps)))
We found 572 snapshots in StateA among 5000 total snapshots

Lets do something more useful: For TIS ensemble we want statistics on pathlengths associated with sampled trajectories Sample objects that are sampled for a specific ensemble. And we one want samples that have been generated in our production runs and are present in a SampleSet

TODO: add a way to select only specific SampleSets

In [36]:
print(storage.samplesets[0])
<openpathsampling.sample.SampleSet object at 0x11cac5ef0>
In [37]:
my_network = storage.networks[0]
my_ensemble = my_network.sampling_ensembles[0]
relevant_samples = [
    sample 
    for sample_set in storage.samplesets 
    for sample in sample_set 
    if sample.ensemble is my_ensemble
]
print(len(relevant_samples))
1001

and finally compute the average length

In [38]:
list_of_path_lengths = [
    len(sample.trajectory)
    for sample_set in storage.samplesets 
    for sample in sample_set 
    if sample.ensemble is my_ensemble
]
print(list_of_path_lengths)
if len(list_of_path_lengths) > 0:
    mean = float(sum(list_of_path_lengths))/len(list_of_path_lengths)
else:
    mean = 0.0 # actually, it is not defined, so we just set it to zero
print(mean)
[37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 44, 31, 31, 31, 31, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 36, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 46, 46, 46, 46, 46, 46, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 75, 75, 69, 69, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 26, 26, 26, 26, 26, 26, 26, 26, 26, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
20.195804195804197

Allright, we loaded from a bootstrapping sampling algorithm and the analysis is pointless, but still it is rather short considering what we just did.

Generator expression

There is another very cool feature about python that is worth noting: generator expressions. Before we used list comprehensions to generate a list of all that we need, but what, if we don't want the whole list at once? Maybe that is impossible because of too much memory and also not desirable? We can do the same thing as above using a generator (although it would only be useful if we had to average over billions of samples). So assume the list of lengths is too large for memory. The summing does not mind to use little pieces so we construct a function that always gives us the next element. These functions are called iterators and to make these iteratore there is syntactic way to create them easily: Instead of square brackets in in list comprehensions use round brackets. So the above example would look like this

In [39]:
iterator_over_path_lengths = (
    len(sample.trajectory)
    for sample_set in storage.samplesets
    for sample in sample_set 
    if sample.ensemble is my_ensemble
)
total = float(sum(iterator_over_path_lengths))
print(total)
20216.0

Note that we now have a generator and no computed values yet. If we iterator using our iterator called generator it will pass one value at a time and we can use it in sum as we did before. There are two important things to note. Once an iteratore is used, it is consumed and we cannot just be run again so we need to change the code again. I assume there are other ways to do that, too

In [40]:
iterator_over_path_lengths = (
    len(sample.trajectory)
    for sample_set in storage.samplesets 
    for sample in sample_set 
    if sample.ensemble is my_ensemble
)
total = 0
count = 0
for length in iterator_over_path_lengths:
    total += length
    count += 1
    
if count > 0:
    mean = float(total)/count
else:
    mean = 0.0 # actually, it is not defined, so we just set it to zero
print(mean)
20.195804195804197

Voilà, this time without computing all length before!

A last example that will be interesting is the statistics on acceptance. Each sample knows which mover was involved in its creation. This is stored in .details.mover in the .details attribute. Let us try to look at only forward moves

In [41]:
ff_movers = list(filter(lambda self : type(self) == paths.ForwardShootMover, storage.pathmovers))
In [42]:
ff_movers
Out[42]:
[<openpathsampling.pathmover.ForwardShootMover at 0x11dba4898>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11d312c88>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11db53080>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11d493588>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11d493cc0>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11dcb2898>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11d493ba8>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11d477d68>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11db53128>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11d493860>]
In [43]:
if len(ff_movers) > 2:
    mover = ff_movers[2]
    print("Use a '%s' for ensemble(s) '%s'" % ( mover.cls, mover.ensemble.name ))
Use a 'ForwardShootMover' for ensemble(s) 'Out A 2'

I use a little trick here, notice that we use a list comprehension inside of a function call, this actually uses the generator expression and passes the resulting iterator to the .join function.

Now to get statistics on acceptances

(tutorial_storage.ipynb; tutorial_storage.py)