Splitting Files

OpenPathSampling saves all the information about the simulation, including the coordinates and velocities of every snapshot. This makes it possible to perform many different analyses later, even analyses that hadn’t been expected before the sampling.

However, this also means that the files can be very large, and frequently we don’t need all the coordinate and velocity data. This example will show how to split the file into two: a large file with the coordinates and velocities, and a smaller file with only the information needed to run the main analysis. This allows you to copy the smaller file to a local drive and perform the analysis interactively.

This particular example extends the toy MSTIS example. It shows how to split the file, and then shows that the analysis still works.


toy_mstis_A1_split

Splitting a simulation

Included in this notebook:

  • Split a full simulation file into trajectories and the rest
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import openpathsampling as paths
import numpy as np

The optimum way to use storage depends on whether you're doing production or analysis. For analysis, you should open the file as an AnalysisStorage object. This makes the analysis much faster.

In [2]:
%%time
storage = paths.AnalysisStorage("mstis.nc")
CPU times: user 8.86 s, sys: 359 ms, total: 9.22 s
Wall time: 9.53 s
In [3]:
st_split = paths.Storage('mstis_strip.nc', 'w')
In [4]:
# st_traj = paths.Storage('mstis_traj.nc', 'w')
# st_data = paths.Storage('mstis_data.nc', 'w')
In [5]:
st_split.fallback = storage
In [5]:
# st_data.fallback = storage

Store all trajectories completely in the data file

In [6]:
# st_data.snapshots.save(storage.snapshots[0])
# st_traj.snapshots.save(storage.snapshots[0])
Out[6]:
UUID('cb90664f-80cc-11e6-90fa-0000000098cb')

Add a single snapshot as a reference and create the appropriate stores

In [6]:
st_split.snapshots.save(storage.snapshots[0])
Out[6]:
UUID('cb90664f-80cc-11e6-90fa-0000000098cb')

Store only shallow trajectories (empty snapshots) in the main file

fix CVs first, rest is fine

In [7]:
cvs = storage.cvs
In [15]:
q = storage.snapshots.all()

fill weak cache from stored cache. This should be fast and we can later use the weak cache (as long as q exists) to fill the cache of the data file.

In [16]:
%%time
_ = [cv(q) for cv in cvs]
CPU times: user 1.94 s, sys: 41.1 ms, total: 1.98 s
Wall time: 1.96 s

Now that we have cached the CV values we can save the CVs in the new store. This will also set the disk cache to the new file and since the file is new this one is empty.

In [17]:
%%time
# this will also switch the storage cache to the new file
_ = map(st_split.cvs.save, storage.cvs)
CPU times: user 83.1 ms, sys: 289 ms, total: 372 ms
Wall time: 729 ms
In [9]:
# %%time
# # this will also switch the storage cache to the new file
# _ = map(st_data.cvs.save, storage.cvs)
CPU times: user 79.5 ms, sys: 57.7 ms, total: 137 ms
Wall time: 136 ms

if all cvs are really cached we can store snapshots now and the auto-complete will fill the CV disk store automatically when snapshots are saved. This takes a little while.

In [18]:
len(st_split.snapshots)
Out[18]:
2
In [20]:
%%time
_ = map(st_split.trajectories.mention, storage.trajectories)
CPU times: user 37.2 s, sys: 688 ms, total: 37.9 s
Wall time: 38 s
In [ ]:
print len(st_split.snapshotspshots)
In [10]:
# %%time
# _ = map(st_data.trajectories.mention, storage.trajectories)
CPU times: user 32.8 s, sys: 410 ms, total: 33.3 s
Wall time: 33.3 s

Fill trajectory store only with trajectories and their snapshots. We are using lots of small snapshots and these are slow in comparison to large ones. So this will also take a minute or so.

In [11]:
%%time
_ = map(st_traj.trajectories.save, storage.trajectories)
CPU times: user 1min 58s, sys: 923 ms, total: 1min 59s
Wall time: 1min 59s

Finally try storing all steps from the simulation. This should contain ALL you need.

In [12]:
%%time
_ = map(st_data.steps.save, storage.steps)
CPU times: user 7.08 s, sys: 104 ms, total: 7.19 s
Wall time: 7.19 s

And compare file sizes

In [13]:
print 'Original file:', storage.file_size_str
print 'Data file:', st_data.file_size_str
print 'Traj file:', st_traj.file_size_str
Original file: 61.51MB
Data file: 49.50MB
Traj file: 25.74MB
In [25]:
print 'So we saved about %2.0f %%' % ((1.0 - st_data.file_size / float(storage.file_size)) * 100.0)
So we saved about 20 %

now we do the trick and use the small data file instead of the full simulation and see if that works.

In [26]:
st_data.close()
st_traj.close()
storage.close()
In [ ]:
st_data.snapshots.only_mention = True

(toy_mstis_A1_split.ipynb; toy_mstis_A1_split.py)


toy_mstis_A2_split_analysis

Analyzing a split MSTIS simulation

Included in this notebook:

  • Opening split files and look at the data
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import openpathsampling as paths
import numpy as np
In [2]:
%%time
storage = paths.AnalysisStorage('mstis_data.nc')
CPU times: user 7.65 s, sys: 127 ms, total: 7.78 s
Wall time: 7.78 s

Analyze the rate with no snapshots present in the analyzed file

In [3]:
mstis = storage.networks.load(0)
In [4]:
mstis.hist_args['max_lambda'] = { 'bin_width' : 0.02, 'bin_range' : (0.0, 0.5) }
mstis.hist_args['pathlength'] = { 'bin_width' : 5, 'bin_range' : (0, 150) }
In [5]:
%%time
mstis.rate_matrix(storage.steps, force=True)
CPU times: user 4.87 s, sys: 245 ms, total: 5.12 s
Wall time: 4.96 s
Out[5]:
{x|opA(x) in [0.0, 0.2]} {x|opB(x) in [0.0, 0.2]} {x|opC(x) in [0.0, 0.2]}
{x|opA(x) in [0.0, 0.2]} NaN 0.00139595 0
{x|opB(x) in [0.0, 0.2]} 0.00229702 NaN 0.0128833
{x|opC(x) in [0.0, 0.2]} 0.000852395 0.00485865 NaN

Move scheme analysis

In [6]:
scheme = storage.schemes[0]
In [7]:
scheme.move_summary(storage.steps)
Null moves for 1 cycles. Excluding null moves:
ms_outer_shooting ran 4.500% (expected 4.98%) of the cycles with acceptance 21/27 (77.78%)
repex ran 20.667% (expected 22.39%) of the cycles with acceptance 49/124 (39.52%)
shooting ran 47.333% (expected 44.78%) of the cycles with acceptance 207/284 (72.89%)
minus ran 2.500% (expected 2.99%) of the cycles with acceptance 11/15 (73.33%)
pathreversal ran 25.000% (expected 24.88%) of the cycles with acceptance 99/150 (66.00%)

Replica move history tree

In [8]:
import openpathsampling.visualize as vis
reload(vis)
from IPython.display import SVG
In [9]:
tree = vis.PathTree(
    storage.steps[0:200],
    vis.ReplicaEvolution(replica=2, accepted=False)
)

SVG(tree.svg())
Out[9]:
+BFRBFRBRBBBBFFBExtendExtendcorstep*144445486163647276104114123124135155
In [10]:
decorrelated = tree.generator.decorrelated
print "We have " + str(len(decorrelated)) + " decorrelated trajectories."
We have 3 decorrelated trajectories.

Visualizing trajectories

In [11]:
from toy_plot_helpers import ToyPlot
background = ToyPlot()
background.contour_range = np.arange(-1.5, 1.0, 0.1)
background.add_pes(storage.engines[0].pes)
In [12]:
xval = paths.FunctionCV("xval", lambda snap : snap.xyz[0][0])
yval = paths.FunctionCV("yval", lambda snap : snap.xyz[0][1])
live_vis = paths.StepVisualizer2D(mstis, xval, yval, [-1.0, 1.0], [-1.0, 1.0])
live_vis.background = background.plot()

to make this work we need the actual snapshot coordinates! These are not present in the data file anymore so we attach the traj as a fallback. We are not using analysis storage since we do not cache anything.

In [14]:
storage.cvs
Out[14]:
store.cvs[CollectiveVariable]
In [13]:
fallback = paths.Storage('mstis_traj.nc', 'r')
In [14]:
storage.fallback = fallback
In [15]:
live_vis.draw_samples(list(tree.samples))
Out[15]:
In [ ]:
 

(toy_mstis_A2_split_analysis.ipynb; toy_mstis_A2_split_analysis.py)