Jekyll2024-01-16T16:34:46+00:00https://signac.io/feed.xmlsignacThe signac framework aids in the management of large and heterogeneous data spaces. It provides a simple and robust data model to create a well-defined indexable storage layout for data and metadata. This makes it easier to operate on large data spaces, streamlines post-processing and analysis and makes data collectively accessible.
Carl Simon AdorfRelease of signac version 2.0.02023-03-30T00:00:00+00:002023-03-30T00:00:00+00:00https://signac.io/development/2023/03/30/release-signac-2.0.0<p>The signac developers are proud to release signac version <strong>2.0.0</strong>! Over two years in the making, this major release streamlines the package, cutting over 10,000 lines of unused code.</p>
<p>Some highlights of this release include:</p>
<ul>
<li>a cleaner property-based API for the <code class="language-plaintext highlighter-rouge">Project</code> and <code class="language-plaintext highlighter-rouge">Job</code> classes,</li>
<li>the ability to access the project of a job with <code class="language-plaintext highlighter-rouge">job.project</code>,</li>
<li>a new internal signac schema that simplifies and declutters project layouts.</li>
</ul>
<p>We’ve also released signac-flow <strong>0.25.0</strong> and signac-dashboard <strong>0.5.0</strong> that support the changes in the signac package. Update these packages using your <a href="https://docs.signac.io/en/latest/installation.html">preferred installation channels</a>.</p>
<p>The removal of various old code paths makes the code easier than ever to work with, so it’s a great time to <a href="https://docs.signac.io/en/latest/community.html#contributions">get involved</a>!</p>
<p>For a full list of changes, see the <a href="https://github.com/glotzerlab/signac/releases/tag/v2.0.0">changelog</a>.</p>Corwin Kerrcbkerr@umich.eduThe signac developers are proud to release signac version 2.0.0! Over two years in the making, this major release streamlines the package, cutting over 10,000 lines of unused code.Summarizing Aggregation over the Summer2020-08-22T00:00:00+00:002020-08-22T00:00:00+00:00https://signac.io/gsoc/2020/08/22/gsoc-hardik-blog-6<p>Hello, welcome to the final blog in the <em>Series of Blogs by Hardik</em>.
If you haven’t read my previous blogs then feel free to have a look at them.
The links to the blogs are provided below.</p>
<ul>
<li>
<a href="/gsoc/2020/08/08/gsoc-hardik-blog-5.html">The Last Phase, GSoC 2020!</a>
</li>
<li>
<a href="/gsoc/2020/07/23/gsoc-hardik-blog-4.html">Aggregates, a user problem?</a>
</li>
<li>
<a href="/development/2020/06/26/local-SLURM-environment.html">Local SLURM cluster setup</a>
</li>
<li>
<a href="/gsoc/2020/06/26/gsoc-hardik-blog-2.html">Coding Aggregation Begins...</a>
</li>
<li>
<a href="/gsoc/2020/06/11/gsoc-hardik-blog-1.html">Introducing Aggregation</a>
</li>
</ul>
<p>So finally an amazing journey of Google Summer of Code (GSoC) has come to an end.
Over the summer I learned about a lot of things and the slope of the learning curve is definitely going to increase in future.
I’d like to thank <strong>Bradley, Brandon, Vyas, Alyssa, Mike, Simon</strong> and rest of the <strong>signac</strong> team for their constant help and support throughout the summer.
This blog post documents the work I did for GSoC 2020 on introducing a feature of aggregate operations in <strong>signac-flow</strong>.</p>
<h1 id="project-description">Project Description</h1>
<p>The signac data and workflow model is primarily designed around the concept of operations acting on jobs, where the management of the job’s data is handled by the <strong>signac</strong> package and the workflow definition and execution is handled by <strong>signac-flow</strong>.
The current workflow model treats operations as always acting on single jobs.
This project allows the users to execute operations that accept multiple jobs as its arguments.</p>
<p>A practical example of using an aggregate operation is described in <a href="https://github.com/glotzerlab/signac-examples/pull/15" target="_blank">this pull request</a>.
In the example we aim to generate a plot of temperatures (in °C) v/s days of a month having 31 days.
After that, we compare that plot with the average temperature of that month.</p>
<p>I raised a pull request (now closed) which gave the team an overview of aggregation.
This pull request helped me track my project.
This was a very large pull request hence it wouldn’t been a nice decision to merge that pull request, hence the team suggested me to break my work into several small steps.</p>
<p>I’ll now describe my approach to the project in several points.</p>
<h3 id="make-flowcondition-class-private-315">Make <code class="language-plaintext highlighter-rouge">FlowCondition</code> class private (<a href="https://github.com/glotzerlab/signac-flow/pull/315" target="_blank">#315</a>)</h3>
<p>No user facing method currently requires the access to this class directly or returns the instances of this class.
Moreover, this class should never be instantiated by the users directly.
Also, the condition functions are evaluated using this class and we’ll see in the upcoming points that internally every method is passed in a list of jobs as a single positional argument rather than a variable argument.
This could lead to confusion for the users to handle such classes.
Hence, this class doesn’t need to be in the public API.</p>
<h3 id="make-flowoperation-callable-326">Make <code class="language-plaintext highlighter-rouge">FlowOperation</code> callable (<a href="https://github.com/glotzerlab/signac-flow/pull/326" target="_blank">#326</a>)</h3>
<p>The classes <code class="language-plaintext highlighter-rouge">FlowCmdOperation</code> and <code class="language-plaintext highlighter-rouge">FlowOperation</code> are responsible for handling logics associated with signac operations with or without the <code class="language-plaintext highlighter-rouge">@cmd</code> decorator respectively.
Previously the logic of calling these operation functions was different internally but since aggregation is getting introduced, we will need to maintain consistency throughout the code base in order to avoid confusion.</p>
<h3 id="make-joboperation-private-325">Make <code class="language-plaintext highlighter-rouge">JobOperation</code> private (<a href="https://github.com/glotzerlab/signac-flow/pull/325" target="_blank">#325</a>)</h3>
<p>The <code class="language-plaintext highlighter-rouge">JobOperation</code> class was exposed to users for the primary purpose of using with <code class="language-plaintext highlighter-rouge">submit_operations</code> and <code class="language-plaintext highlighter-rouge">run_operations</code>.
The use case for this class was small and the structure of the class (after aggregation) was supposed to get changed.
Hence, <code class="language-plaintext highlighter-rouge">JobOperation</code> and the methods which returns its instances were deprecated and are scheduled to be removed in <strong>signac-flow</strong> version 0.13.</p>
<h3 id="deprecate-eligible-and-complete-methods-from-the-user-api-337">Deprecate <code class="language-plaintext highlighter-rouge">eligible</code> and <code class="language-plaintext highlighter-rouge">complete</code> methods from the user API (<a href="https://github.com/glotzerlab/signac-flow/pull/337" target="_blank">#337</a>)</h3>
<p>The <code class="language-plaintext highlighter-rouge">eligible</code> and <code class="language-plaintext highlighter-rouge">complete</code> methods were originally used for checking whether a job operation pair was eligible to run or get submitted or is complete respectively.
The use case for this class is small and it may create confusion for users to deal while checking eligibilty of aggregate-operation pair.
Hence, <code class="language-plaintext highlighter-rouge">eligible</code> and <code class="language-plaintext highlighter-rouge">complete</code> methods of <code class="language-plaintext highlighter-rouge">FlowGroup</code> and <code class="language-plaintext highlighter-rouge">BaseFlowOperation</code> were deprecated and are scheduled to be removed in <strong>signac-flow</strong> version 0.13.</p>
<h3 id="enable-aggregate-logic-in-flow-324">Enable aggregate logic in flow (<a href="https://github.com/glotzerlab/signac-flow/pull/324" target="_blank">#324</a>)</h3>
<p>Reviewing actual aggregation will becomes much easier if <strong>signac-flow</strong> starts supporting the logic of aggregates of 1.
This pull request internally converts all jobs into aggregates of one.
Hence we now pass in a tuple of a single job to every method internally.
This was the first major pull request which got merged into master branch.</p>
<h3 id="change-submission-id-to-support-aggregation-334">Change submission ID to support aggregation (<a href="https://github.com/glotzerlab/signac-flow/pull/334" target="_blank">#334</a>)</h3>
<p>Previously, every <code class="language-plaintext highlighter-rouge">JobOperation</code> instance which was submitted holded a submission id which was a unique id responsible to identify the job associated to any group of operation.
Decisions like how should an aggregate-operation be represented in a submission script, how to make an id of aggregate associated with a group unique were made in this PR.
The id will now contain details like group name, length of aggregate, concatenated job ids of the jobs in the aggregate.
The representation of an aggregate-operation in a script will be as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my-op[#1](26021048)
my-op[#2](26021048, 42b7b4f2)
my-op[#3](26021048, 42b7b4f2, 44550aef)
my-op[#4](26021048, ..., 4b893796)
</code></pre></div></div>
<h3 id="add-aggregator-classes-to-flow-348">Add aggregator classes to flow (<a href="https://github.com/glotzerlab/signac-flow/pull/348" target="_blank">#348</a>)</h3>
<p>This pull request is currently the latest one that I’ve filed.
A new and a more efficient way for storing aggregates suggested by my mentors is now getting implemented in this PR.
This PR introduces aggregator classes to flow which are responsible for registering, storing, and generating aggregates whenever required.
Those are namely <code class="language-plaintext highlighter-rouge">aggregator</code>, <code class="language-plaintext highlighter-rouge">_AggregatesStore</code>, <code class="language-plaintext highlighter-rouge">_DefaultAggregateStore</code>, and <code class="language-plaintext highlighter-rouge">_MakeAggregate</code>.
The <code class="language-plaintext highlighter-rouge">aggregator</code> class will be used by the users as a decorator class for the operation functions.
It includes features of aggregating the jobs by some number or grouping them by multiple statepoint parameters, sorting them via some statepoint parameter (in a reversed order as well), and selecting only a few jobs from the project using a <code class="language-plaintext highlighter-rouge">select</code> argument.</p>
<h3 id="enable-aggregate-status-check-335">Enable aggregate status check (<a href="https://github.com/glotzerlab/signac-flow/pull/335" target="_blank">#335</a>)</h3>
<p>The changes made in this pull request handles the issue with status check as described in one of my <a href="https://signac.io/gsoc/2020/07/23/gsoc-hardik-blog-4.html" target="_blank">blog post</a>.
Decisions like aggregates will now get registered on initialization of a <code class="language-plaintext highlighter-rouge">FlowProject</code> and if a user decides to change the aggregator associated with an operation function then the user has to register aggregates using the <code class="language-plaintext highlighter-rouge">register_aggregates()</code> method (or initialize the project once again) else the previously registered aggregates will be used.
Every aggregate will now have an id associated with it.
So, if a user wants to run an operation for that particular aggregate then the user can get the id of an aggregate using the <code class="language-plaintext highlighter-rouge">get_aggregate_id()</code> method.
After that the command line option <code class="language-plaintext highlighter-rouge">-j</code> can be used to specify the id.
An example command to run the operation which either accepts a single job having id <code class="language-plaintext highlighter-rouge">job_id1</code> or an aggregate having id <code class="language-plaintext highlighter-rouge">aggregate_id1</code> would be:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python project.py run -j job_id1 aggregate_id1
</code></pre></div></div>
<p><strong>Work left to do in this PR:</strong>
Since <a href="https://github.com/glotzerlab/signac-flow/pull/348" target="_blank">#348</a> is the latest PR for the project, hence this PR needs to add support for all the new features introduced.</p>
<h3 id="add-aggregation-feature-to-flow-336">Add aggregation feature to flow (<a href="https://github.com/glotzerlab/signac-flow/pull/336" target="_blank">#336</a>)</h3>
<p>This pull request, when merged, will enable the users to perform actual aggregation for their workflow.
In this PR I have refactored all the templates used for status printing in order to support aggregation.
I also wrote all the necessary tests for testing the aggregation feature.
<strong>signac-flow</strong> will now provide a per aggregate detailed status overview which will show all the jobs in aggregates associated with every aggregated operation.
Users can also use <code class="language-plaintext highlighter-rouge">--orphan</code> command line option with status check to fetch the details of “orphaned” aggregates which were submitted previously but are no longer considered for execution because of modifications in the data space (e.g. the deletion of a job in the aggregate or creation of new jobs that belong in that aggregate).
A sample aggregate-status view while using the status query <code class="language-plaintext highlighter-rouge">python project.py status --detailed</code> is given below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Detailed Aggregate View:
operation jobs_in_aggregate length_of_aggregate status
----------- -------------------------------- --------------------- --------
compute_sum ee5ca9ab62e9dbb7b6abbaaac6443d49 4 [U]
compute_sum 03086200817396c6083c34ac025ec4d5 4 [U]
compute_sum 92f821919b4b0a2f15d0ef3f5d433550 4 [U]
compute_sum 5d63db8dc4821a190f690fd66e4dd0be 4 [U]
compute_sum as32asdga2e9dbb7b6abbaaac6443d49 2 [U]
compute_sum 2sgj2k00817396c6083c34ac025ec4d5 2 [U]
</code></pre></div></div>
<p><strong>Work left to do in this PR:</strong>
Add support for classes responsible for storing aggregates in <a href="https://github.com/glotzerlab/signac-flow/pull/348" target="_blank">#348</a>.</p>
<p>I’m really looking forward to see the aggregation feature being used in the real-world.
This was the major portion of my GSoC journey.
Now I’ll describe the work I did during the summer which was loosely related to the aggregation project.</p>
<h3 id="add-tests-for-directives-class-283">Add tests for Directives class (<a href="https://github.com/glotzerlab/signac-flow/pull/283" target="_blank">#283</a>)</h3>
<p>Wrote the tests for two classes <code class="language-plaintext highlighter-rouge">Directives</code> and <code class="language-plaintext highlighter-rouge">DirectivesItem</code> that serve as a smart mapping for the environment, user-specified directives and a specification for environment directives respectively.</p>
<h3 id="add-pre-commit-hooks-to-signac-and-signac-flow-358-333">Add pre-commit hooks to <strong>signac</strong> and <strong>signac-flow</strong> (<a href="https://github.com/glotzerlab/signac/pull/358" target="_blank">#358</a>, <a href="https://github.com/glotzerlab/signac-flow/pull/333" target="_blank">#333</a>)</h3>
<p>This ensures that the code and documentation written by developers are compliant before committing.
The documentation on how to setup a pre-commit hook can be found <a href="https://github.com/glotzerlab/signac-docs/pull/92" target="_blank">here</a>.</p>
<h2 id="for-prospective-gsoc-2021-students">For prospective GSoC 2021 students</h2>
<p>Students appearing in GSoC 2021 should start contributing to the open source community in order to get some basic concepts of programming and version control system used by the organization (mostly Git).
Communicating with the team is the most important part, this will improve your bonding with the community and will always help you in your life somewhere.</p>Hardik Ojhahojha@ee.iitr.ac.inHello, welcome to the final blog in the Series of Blogs by Hardik. If you haven’t read my previous blogs then feel free to have a look at them. The links to the blogs are provided below.End of GSoC journey2020-08-21T00:00:00+00:002020-08-21T00:00:00+00:00https://signac.io/gsoc/2020/08/21/gsoc-vishav-blog-6<p>Hi everyone, this is Vishav and I am here with the final iteration of my blog, “Journey to GSoC”.
If you haven’t read my previous blog, you can read it <a href="https://signac.io/gsoc/2020/08/06/gsoc-vishav-blog-5.html" target="_blank">here</a> and keep up.</p>
<p>The last week of three month period of <code class="language-plaintext highlighter-rouge">Google Summer of Code</code> is here.
It’s been a great learning experience and a fantastic journey.
Apart the technical learning, this project also introduced me to an amazing community.
This blog describes my whole journey throughout the project in one space.
A single blog cannot describe all my learning and experiences but I am doing my best to pour all my accumulations into this blog.</p>
<h2 id="improve-synced-data-structures-336">Improve Synced Data Structures (<a href="https://github.com/glotzerlab/signac/pull/336" target="_blank">#336</a>)</h2>
<p>This PR marked the beginning of my GSoC project and implemented the basic synced data structures.
Earlier the JSON backend was implemented using the classes: <code class="language-plaintext highlighter-rouge">SyncedAttrDict</code>, <code class="language-plaintext highlighter-rouge">SyncedList</code> and <code class="language-plaintext highlighter-rouge">JSONDict</code>.
But these classes provide limited functionality, like singular backend and limited support for nesting structures.
So in order to provide different backend and support different data structures, I refactored these classes.
In this PR, I have added the following classes:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">SyncedCollection</code>: This class is intended for use as an abstract base class. In addition, it declares as abstract methods the methods that must be implemented by any subclass to match the API.</li>
<li><code class="language-plaintext highlighter-rouge">SyncedAttrDict</code>: Implements the dict data structure of API.</li>
<li><code class="language-plaintext highlighter-rouge">SyncedList</code>: Implements the the list data structure of API.</li>
<li><code class="language-plaintext highlighter-rouge">JSONCollection</code>: Implements synchronization functions for JSON backend.</li>
<li><code class="language-plaintext highlighter-rouge">JSONDict</code>: Implements dict data structure with JSON backend.</li>
<li><code class="language-plaintext highlighter-rouge">JSONList</code>: Implements list data structure with JSON backend.</li>
</ul>
<h2 id="drop-support-for-python-35-340">Drop Support for Python 3.5 (<a href="https://github.com/glotzerlab/signac/pull/340" target="_blank">#340</a>)</h2>
<p>This PR drops the support for Python 3.5.
This was necessary because we use <code class="language-plaintext highlighter-rouge">collections.abc.Collection</code> in implementation and this was introduced in Python 3.6 .</p>
<h2 id="added-backends-to-syncedcollection-364">Added backends to SyncedCollection (<a href="https://github.com/glotzerlab/signac/pull/364" target="_blank">#364</a>)</h2>
<p>This PR adds <code class="language-plaintext highlighter-rouge">ZarrCollection</code>, <code class="language-plaintext highlighter-rouge">RedisCollection</code>, and <code class="language-plaintext highlighter-rouge">MongoDBCollection</code> to implement the <code class="language-plaintext highlighter-rouge">zarr</code>, <code class="language-plaintext highlighter-rouge">redis</code>, and <code class="language-plaintext highlighter-rouge">MongoDB</code> backend respectively to synced data structures.
Every backend also provide dict and list data-structures implementations similar to JSON backend.</p>
<h2 id="added-buffering-and-caching-to-syncedcollection-363">Added buffering and caching to SyncedCollection (<a href="https://github.com/glotzerlab/signac/pull/363" target="_blank">#363</a>)</h2>
<p>In buffering, we suspend the synchronization with the backend and the data is temporarily stored in buffer.
All write operations are written to the buffer, and read operations are performed from the buffer whenever possible.
In caching, we store a copy of data in the memory so the next read operations will fetch the data from the memory instead of underlying-backend.
These both provide better peformance as we fetch the data from the memory.</p>
<h2 id="added-hypothesis-based-test-to-syncedcollection-373">Added hypothesis based test to SyncedCollection (<a href="https://github.com/glotzerlab/signac/pull/373" target="_blank">#373</a>)</h2>
<p>I worked on the adding hypothesis based testing to <code class="language-plaintext highlighter-rouge">SyncedCollection</code>.
There were a lot of problems with hypothesis in combination with pytest fixtures.
So we decided to close this PR and approach the problem at a later date.</p>
<h2 id="added-validation-layer-to-syncedcollection-378">Added validation layer to SyncedCollection (<a href="https://github.com/glotzerlab/signac/pull/378" target="_blank">#378</a>))</h2>
<p>This PR adds validation layer to the <code class="language-plaintext highlighter-rouge">SyncedCollection</code> by adding a validator (or list of validators) that are applied to inputs.
Previously, we only have a function that validate the keys of <code class="language-plaintext highlighter-rouge">SyncedAttrDict</code>.
Now, it generalizes this behaviour to validate all input data, and not just <code class="language-plaintext highlighter-rouge">SyncedAttrDict</code>.</p>
<h2 id="whats-left-to-do">What’s left to do</h2>
<ul>
<li><code class="language-plaintext highlighter-rouge">Lazy statepoint loading</code>: This changes behavior of <code class="language-plaintext highlighter-rouge">Job</code> to load its statepoint lazily, when opened by <code class="language-plaintext highlighter-rouge">id</code></li>
</ul>
<p>I believe this GSoC journey will shape my career path and direct my attitude in the right way.
Over the summer I learned about a lot of things and have come out as a better developer and a better person as a whole.
I’d like to extend my heartfelt thanks towards my mentors and the whole <strong>signac</strong> community for their constant help and support throughout the summer.</p>Vishav Sharmavishavsharma1771@gmail.comHi everyone, this is Vishav and I am here with the final iteration of my blog, “Journey to GSoC”. If you haven’t read my previous blog, you can read it here and keep up.The Last Phase, GSoC 2020!2020-08-08T00:00:00+00:002020-08-08T00:00:00+00:00https://signac.io/gsoc/2020/08/08/gsoc-hardik-blog-5<p>Hello, welcome to the 5th blog in the <em>Series of Blogs by Hardik</em>.</p>
<p>In this post, I will explain how you should test the aggregation feature and my strategy for the last phase of <strong>Google Summer of Code (GSoC) 2020</strong>.
So, let’s get started.</p>
<h2 id="using-the-aggregation-feature">Using the aggregation feature</h2>
<p>I have posted a <a href="https://github.com/glotzerlab/signac-flow/pull/336" target="_blank">draft pull request</a>.
This pull request introduces the aggregator classes to <strong>signac-flow</strong> which enables actual aggregation.
If you want to try how aggregation will work, you should install <strong>signac-flow</strong> in your python virtual environment in editable mode.
You can perform the steps mentioned below to setup a development environment for <strong>signac-flow</strong>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/glotzerlab/signac-flow.git
cd signac-flow/
git checkout feature/introduce-aggregator-classes
pip install -e .
</code></pre></div></div>
<p>There is an existing <a href="https://github.com/glotzerlab/signac-examples/pull/15" target="_blank">pull request</a> on the <strong>signac-examples</strong> repository which may help you understand the workflow.
This example generates a plot of temperatures (in °C) v/s days of a month having 31 days.
We also visually compare the temperature of every day with the average temperature of that month.</p>
<p>Please make sure to clone the repository locally and switch to the branch <code class="language-plaintext highlighter-rouge">aggregate-example</code>.
You will find the examples in the <code class="language-plaintext highlighter-rouge">notebooks/</code> directory.
You can perform the steps mentioned below:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/glotzerlab/signac-examples.git
cd signac-examples/
git checkout aggregation-example
</code></pre></div></div>
<h2 id="plan-for-the-last-phase-of-gsoc">Plan for the last phase of GSoC</h2>
<p>Firstly, I will refactor status printing in order to make the status printing more efficient.
Testing is the one of the most important part of development.
Since I have completed a fair amount of work, I will write tests for aggregation in this phase.
I plan to submit a well-documented code for GSoC 2020.
I also plan to write a few examples for <strong>signac-examples</strong> so that users can know how to use this feature with ease.</p>Hardik Ojhahojha@ee.iitr.ac.inHello, welcome to the 5th blog in the Series of Blogs by Hardik.Into the final phase2020-08-06T00:00:00+00:002020-08-06T00:00:00+00:00https://signac.io/gsoc/2020/08/06/gsoc-vishav-blog-5<p>Hi everyone, this is Vishav and I am here with the 5th iteration of my blog, “Journey to GSoC”.
If you haven’t read my previous blog, you can read it <a href="https://signac.io/gsoc/2020/07/23/gsoc-vishav-blog-4.html" target="_blank">here</a> and keep up.</p>
<p>The second phase of <strong>Google Summer of Code</strong> has been completed.
I am done with <code class="language-plaintext highlighter-rouge">Buffering</code> and almost done with <code class="language-plaintext highlighter-rouge">Caching</code>.
The work can be found <a href="https://github.com/glotzerlab/signac/pull/363" target="_blank">here</a>.</p>
<p>The final phase of <strong>Google Summer of Code</strong> has started and this phase will mainly focus on documentation and testing of written codebase during previous two phases.
Apart from these, the other goals are to implement different backends and <code class="language-plaintext highlighter-rouge">Lazy statepoints Loading</code>.</p>
<p>There were some issues that I faced during the implementation of <code class="language-plaintext highlighter-rouge">caching</code> that were discussed in the <a href="https://github.com/vishav1771/signac/pull/2" target="_blank">PR</a> and can also be found on my previous <a href="https://signac.io/gsoc/2020/07/23/gsoc-vishav-blog-4.html" target="_blank">blog</a>.
After discussing with mentors, we reached a conclusion that, at design level the main distinction to be made between the old implementation and the new is that rather than having all members access a global variable that is the cache, the cache needs to be stored on a per-object basis.
And instead of using a global buffer we will use cache for storing instance.</p>
<p>I worked on the hypothesis issue in this <a href="https://github.com/glotzerlab/signac/pull/373" target="_blank">PR</a>.
There were a lot of problems with hypothesis in combination with pytest fixtures.
So we decided to close this PR and approach the problem at a later date.</p>
<p>I also completed implementing the different backends (<code class="language-plaintext highlighter-rouge">redis</code>, <code class="language-plaintext highlighter-rouge">zarr</code>, <code class="language-plaintext highlighter-rouge">mongodb</code>), during this phase.
The work can be found in this <a href="https://github.com/glotzerlab/signac/pull/364" target="_blank">PR</a>.
While implementing, I faced some issues during the testing of mongo and redis backends.
The problem was that the server has to run on the continuous integration environment on CircleCI, but it is somewhat complicated to configure that server process.
We decided to have “interactive tests” that users can execute locally (with a local redis / mongo backend) but aren’t executed on CI yet.</p>Vishav Sharmavishavsharma1771@gmail.comHi everyone, this is Vishav and I am here with the 5th iteration of my blog, “Journey to GSoC”. If you haven’t read my previous blog, you can read it here and keep up.MIDAS Reproducibility Showcase2020-08-05T00:00:00+00:002020-08-05T00:00:00+00:00https://signac.io/talks/2020/08/05/midas-reproducibility<p>I was recently on a small team of Glotzer Lab members (peeps, as we say) that competed in the <a href="https://midas.umich.edu/reproducibility-challenge/">MIDAS Reproducibility Challenge</a>.
The purpose of this challenge was to “highlight quality, reproducible work at the University of Michigan by collecting examples of best practices across diverse fields.”
We prepared and submitted a report highlighting our efforts in this arena.
Our submission was selected to present at the <a href="https://midas.umich.edu/data-reproducibility/">Reproducibility Showcase</a>, and we gave an approximately 45 minute <a href="https://youtu.be/snJxoAg6_Vw">talk on our group’s approach to reproducibility</a>.
While the group’s full software stack promotes reproducibility through our professional software engineering practices and integration with the scientific Python ecosystem, <strong>signac</strong> is the project that most directly addresses this issue.</p>
<iframe width="758" height="569" src="https://www.youtube.com/embed/snJxoAg6_Vw" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p>Our talk was organized as follows: Joshua Anderson gave an introduction to our software stack and software development practices, then I gave a short introduction to <strong>signac</strong>, followed by two case studies of research projects that represent our group’s efforts towards reproducible science.
My section was short, 7 minutes, so I basically had enough time to introduce the <strong>signac</strong> framework and discuss how <strong>signac-flow</strong> promotes reproducible computational research.
This was my first time giving a talk on <strong>signac</strong>, so despite the brevity of my talk, I found it difficult to put together exactly what I wanted to say.
But as we all know, insight generally lies behind difficulty, and that was certainly the case here.</p>
<p>I had two major revelations while putting together and giving this talk.
And by revelations, I mean things that I’ve known at a surface level but now fully appreciate.
Or, as they say, my knowledge grew into wisdom.
First, I more fully appreciate the ingenuity of <strong>signac</strong>’s workspace organization on the file system.
By hashing the state point into a unique id for each job, you can truly manage an extremely heterogeneous data space with no additional overhead from the complexity of that data space.
In principle, you could manange an entire Ph.D. worth of data in a single <strong>signac</strong> project (just because you <em>can</em> does not mean you <em>should</em>, but you certainly can).
As an added benefit of this organization, your project’s metadata is stored in a completely human readable format — no more trying to parse directory paths for metadata.</p>
<p>My second realization involves TRUE molecular simulations.
There is a <a href="https://www.tandfonline.com/doi/full/10.1080/00268976.2020.1742938">push within the molecular simulation community</a> for simulations are TRUE, that is, Transparent, Reproducible, Usable by others, and Extensible.
I realized as I was talking (literally as I was giving the talk) that not only does <strong>signac</strong> make TRUE simulations easier to achieve, it actually makes it difficult to run simulations that aren’t TRUE.
Once a <strong>signac</strong>-managed computational research project is completed, the researcher can deposit the project and workspace in a data repository (e.g., <a href="https://deepblue.lib.umich.edu/">U-M’s Deep Blue</a>).
The indexing, searching, and filtering capabilities of <strong>signac</strong> then make the data both transparent and usable by others.
The computational workflow defined by the <strong>signac-flow</strong> project means the project is completely reproducible (given enough computational resources, if necessary).
And finally, since <strong>signac</strong> fits nicely into the scientific Python ecosystem, it is straightforward to extend the project.
Hence, by using <strong>signac</strong> and <strong>signac-flow</strong> (<a href="https://knowyourmeme.com/memes/name-a-more-iconic-duo">name a more iconic duo… I’ll wait</a>), you essentially get TRUE simulations for free.</p>
<p>All in all, this was a good experience and I am glad I participated.
Given the <a href="https://en.wikipedia.org/wiki/Replication_crisis">ongoing reproducibility crisis</a> and the über-relevant role that scientists play in the public’s response to crises, it is imperative that all researchers strive for reproducibility.
For computational work, <strong>signac</strong> minimizes the overhead of this challenge, and as a result I am proud to be a part of a research group that makes this pursuit a top priority.</p>Tim Mooremtimc@umich.eduI was recently on a small team of Glotzer Lab members (peeps, as we say) that competed in the MIDAS Reproducibility Challenge. The purpose of this challenge was to “highlight quality, reproducible work at the University of Michigan by collecting examples of best practices across diverse fields.” We prepared and submitted a report highlighting our efforts in this arena. Our submission was selected to present at the Reproducibility Showcase, and we gave an approximately 45 minute talk on our group’s approach to reproducibility. While the group’s full software stack promotes reproducibility through our professional software engineering practices and integration with the scientific Python ecosystem, signac is the project that most directly addresses this issue.Aggregates, a user problem?2020-07-23T00:00:00+00:002020-07-23T00:00:00+00:00https://signac.io/gsoc/2020/07/23/gsoc-hardik-blog-4<p>Hey, welcome to the 4th blog in the <em>Series of Blogs by Hardik</em>.
In my <a href="https://signac.io/development/2020/06/26/local-SLURM-environment.html" target="_blank">last blog</a>,
I described how to setup a local SLURM cluster environment.
Please give it a read and, if possible, provide your valuable feedback.</p>
<p>In this post, I will discuss a problem that I am facing during the development of aggregation.
I will also discuss the approach I’m planning to adapt to resolve the issue.</p>
<p>So, let’s get started.</p>
<h2 id="problems-with-status-check">Problems with status check</h2>
<p>Suppose a user creates an aggregate operation which takes in 3 jobs and then performs a status check after submitting.
Currently, the way I have implemented the status check functionality is by creating aggregates everytime we perform a status check.</p>
<p>So what’s the problem in it?</p>
<p>In case the user changes the order of jobs in the project (sorted by some statepoint parameter), the aggregates will change.
Previously if the user had <code class="language-plaintext highlighter-rouge">Job_a</code> and <code class="language-plaintext highlighter-rouge">Job_b</code> aggregated in the given order and now the order reversed then since both these aggregates (<code class="language-plaintext highlighter-rouge">[Job_a, Job_b]</code> and <code class="language-plaintext highlighter-rouge">[Job_b, Job_a]</code>) are technically different, the user might want to see the status of both the aggregates.
But as we’re creating aggregates everytime we perform a status check, we won’t be able to create the aggregates in the previous order because we don’t have any information of the previous order stored.
This means we cannot know whether any other aggregates, apart from the existing ones, were created for the operation or not.</p>
<p>A few cases demonstrating this problem are dynamic addition of jobs in the project which are capable of changing the aggregates, deletion of jobs from the workspace, etc.</p>
<h2 id="is-it-a-user-problem">Is it a user problem?</h2>
<p>Previously, I thought of this problem as strictly a user problem.
There were many times when I proposed that this issue should addressed as a user problem and a user should strictly be warned about it.
But over the time, all thanks to my mentors, I got to know the use case of aggregation and that completely expanded my view.</p>
<p>As a researcher, I’d sometimes want to play with aggregation and see all the results I could obtain by using different aggregates for the same operation.
At the same time, if I get to know the status of every aggregate then that will be a treat for me.</p>
<p>I hope to provide users with this feature to print status overview of the aggregates which were formed for an operation previously but currently available.</p>
<h3 id="solution">Solution?</h3>
<p>To resolve this, one possible solution is to store the job-ids of the jobs in an aggregate which was <strong>queued for submission</strong>.</p>
<p>Example: <code class="language-plaintext highlighter-rouge">store_aggregates.json</code> will contain the job-ids of all the aggregates formed by every operations in the below described format:
<code class="language-plaintext highlighter-rouge">{'operation_name': {'aggregate_id': [job_id1, job_id2, ...], ...}, ...}</code></p>
<p>Then, during status check, we fetch all the job-ids from that file and if those job-ids don’t match the job-ids of the jobs in the created aggregates then create those aggregates manually.</p>Hardik Ojhahojha@ee.iitr.ac.inHey, welcome to the 4th blog in the Series of Blogs by Hardik. In my last blog, I described how to setup a local SLURM cluster environment. Please give it a read and, if possible, provide your valuable feedback.Completion of Phase II2020-07-23T00:00:00+00:002020-07-23T00:00:00+00:00https://signac.io/gsoc/2020/07/23/gsoc-vishav-blog-4<p>Hi everyone, this is Vishav and I am here with the 4th iteration of my blog, “Journey to GSoC”.
If you haven’t read my previous blog, you can read it <a href="https://signac.io/gsoc/2020/07/10/gsoc-vishav-blog-3.html" target="_blank">here</a> and keep up.</p>
<p>The second phase of <strong>Google Summer of Code</strong> has been completed.
The goal for the second phase of the summer was to add features for buffering and caching and rewriting the tests with hypothesis.
I am done with buffering, you can see my work at <a href="https://github.com/vishav1771/signac/pull/2" target="_blank">this PR</a>.
Caching is still in discussion stage and will be done in the coming days.</p>
<p>During this phase, I faced many blockers but am able to solve most of them with the guidance of the mentors.
Some interesting ones are described below:</p>
<ul>
<li>
<p>I was rewriting tests with <a href="https://hypothesis.works/">hypothesis</a>.
I found that all the <code class="language-plaintext highlighter-rouge">int</code> keys are converted and saved as <code class="language-plaintext highlighter-rouge">str</code> in <code class="language-plaintext highlighter-rouge">JSONDict</code>. If we save a <code class="language-plaintext highlighter-rouge">int</code> key then we will get an error while trying to access <code class="language-plaintext highlighter-rouge">int</code> keys.
In order to solve it, we decided that every collection can have a validator (or list of validators) that are applied to inputs and that the JSON validator should be applied by signac when using synced collections, regardless of the back end.</p>
</li>
<li>
<p>While working on validation, we found that the currently we only provide validation for keys of a dictionary.
For this, we decided to generalize this behaviour so that every input data should be validated before insertion into the collection.</p>
</li>
</ul>
<p>The buffering PR is still being reviewed, so I am focused on completing caching before the start of third phase.
Currently there has been a discussion ongoing related to caching in the <a href="https://github.com/vishav1771/signac/pull/2" target="_blank">PR</a>.
The main questions that need to be answered are:</p>
<ul>
<li>Can we handle the suspended synchronization by simply reading from/writing to an object-specific cache?</li>
<li>How should a cache be defined?
Are caches always in-memory objects?
If not, how do we ensure their synchronization, for instance with respect to the current project cache?</li>
<li>How do we handle multiple synced collections pointing to the same file?</li>
<li>Should we assume that all instances must point to the same cache at any given time?
Can there be multiple active caches?
If not, how do we prevent that?</li>
</ul>
<p>This will mark the end of second phase of <strong>Google Summer of Code</strong>.
For the third phase, I plan to implement different backends and <code class="language-plaintext highlighter-rouge">lazy statepoint loading</code>.</p>Vishav Sharmavishavsharma1771@gmail.comHi everyone, this is Vishav and I am here with the 4th iteration of my blog, “Journey to GSoC”. If you haven’t read my previous blog, you can read it here and keep up.Buffering2020-07-10T00:00:00+00:002020-07-10T00:00:00+00:00https://signac.io/gsoc/2020/07/10/gsoc-vishav-blog-3<p>Hi everyone, this is Vishav and I am here with the 3rd iteration of my blog, “Journey to GSoC.”
If you haven’t read my previous blog, you can read it <a href="https://signac.io/gsoc/2020/06/24/gsoc-vishav-blog-2.html" target="_blank">here</a> and keep up.</p>
<p>The first phase of <strong>Google Summer of Code</strong> has been completed.
I am done with JSON backend (<code class="language-plaintext highlighter-rouge">JSONCollection</code>, <code class="language-plaintext highlighter-rouge">JSONDict</code>, <code class="language-plaintext highlighter-rouge">JSONList</code>).
You can see the work in this <a href="https://github.com/glotzerlab/signac/pull/336" target="_blank">pull request</a>.</p>
<p>The goal for the second phase of the summer is to add features for buffering and caching.</p>
<p>In the first week of second phase I worked to finalize the JSON backend PR from the first phase.
Some touch-ups and some minor issues were adressed with the help of the mentors.
Some issues/discussions that were addressed are:</p>
<ul>
<li>When I was trying to split classes (<code class="language-plaintext highlighter-rouge">SyncedCollection</code>, <code class="language-plaintext highlighter-rouge">SyncedDict</code> etc.) in different files.
I was getting circular import error as I am creating Instance of <code class="language-plaintext highlighter-rouge">JSONDict</code> and <code class="language-plaintext highlighter-rouge">JSONList</code> in <code class="language-plaintext highlighter-rouge">SyncedCollection.from_base</code>.
I discussed this problem with my mentors and they suggested me to use <code class="language-plaintext highlighter-rouge">metaclass</code> for <code class="language-plaintext highlighter-rouge">SyncedCollection</code> and register every child class using it.</li>
<li>There were some problems with the tests, I updated them using <code class="language-plaintext highlighter-rouge">pytest.fixtures</code> and decided to use <a href="https://hypothesis.readthedocs.io/en/latest/">hypothesis</a></li>
<li>The code was fine but there were many discrepancies with the docstrings.
With the help of the mentors I resolved them to make the changes understandable even for new users.</li>
</ul>
<p>With all the above resolved, I finally started working on the buffering.</p>
<h2 id="buffering">Buffering</h2>
<p>In buffering, we suspend the synchronization with the backend and the data is temporarily stored in buffer.
All write operations are written to the buffer, and read operations are performed from the buffer whenever possible.
When we exit the buffering mode, all the buffered data is written to the backend.
Buffering provides better performance because the read and write operations are done in memory.</p>
<h3 id="api-for-buffering">API for Buffering</h3>
<p>The buffering will be provided by <code class="language-plaintext highlighter-rouge">signac.buffered</code> and <code class="language-plaintext highlighter-rouge">SyncedCollection.buffered</code>.
These methods provide a context manager for buffering mode.</p>
<p>The <code class="language-plaintext highlighter-rouge">signac.buffered</code> is a global buffered mode where all the instances of synced data structures such as <code class="language-plaintext highlighter-rouge">JSONDict</code> and <code class="language-plaintext highlighter-rouge">JSONList</code> are buffered.
All write operations are deferred until the <code class="language-plaintext highlighter-rouge">flush_all</code> function is called, the buffer overflows, or upon exiting the buffer mode.</p>
<p>This is a typical example of <code class="language-plaintext highlighter-rouge">signac.buffered</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jsd = signac.JSONDict('test.json')
with signac.buffered():
jsd.a = 'buffered'
assert jsd.a == 'buffered'
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">SyncedCollection.buffered</code> is a context manager provided by individual instances of synced data structures.
All write operations are deferred until the <code class="language-plaintext highlighter-rouge">flush</code> function is called or upon exiting the buffer mode.</p>
<p>This is a typical example of <code class="language-plaintext highlighter-rouge">SyncedCollection.buffered</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jsd = signac.JSONDict('test.json')
with jsd.buffered() as b:
b.a = 'buffered'
assert jsd.a == 'buffered'
</code></pre></div></div>Vishav Sharmavishavsharma1771@gmail.comHi everyone, this is Vishav and I am here with the 3rd iteration of my blog, “Journey to GSoC.” If you haven’t read my previous blog, you can read it here and keep up.Coding Aggregation Begins…2020-06-26T00:00:00+00:002020-06-26T00:00:00+00:00https://signac.io/gsoc/2020/06/26/gsoc-hardik-blog-2<p>Hey, welcome back.
If you’re new to the <em>Series of Blogs by Hardik</em>, you may enjoy reading my <a href="https://signac.io/gsoc/2020/06/11/gsoc-hardik-blog-1.html" target="_blank">first blog</a>.
In this post, I will briefly describe the proposed API, my approach to the project, and the progress I’ve made so far.</p>
<h2 id="proposed-api-for-aggregation">Proposed API for Aggregation</h2>
<p>The API proposed introduces two new classes, <code class="language-plaintext highlighter-rouge">aggregate</code> and <code class="language-plaintext highlighter-rouge">select</code>.
These classes are the decorator classes, meaning you should use them as a decorator for an operation.</p>
<p>The <code class="language-plaintext highlighter-rouge">aggregate</code> class will be responsible for aggregation of jobs.
Using <code class="language-plaintext highlighter-rouge">pre</code> condition for filtering can at times get confusing when it comes to aggregates, hence the <code class="language-plaintext highlighter-rouge">select</code> class filters the jobs for you.</p>
<p>This is a typical example of how to use aggregation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@aggregate.groupsof(4, sort='i', reverse=True)
@select(lambda job: job.sp.i > 4)
@FlowProject.operation
def op(*jobs):
print(jobs)
</code></pre></div></div>
<p><em><strong>Please Note</strong>: The above API is extremely provisional. Hence users should not refer to this as a guide to use aggregation.</em></p>
<p>After aggregation, <strong>signac-flow</strong> will work on a concept that <em>every operation is an aggregate operation.</em>
This means that internally, all operations will be treated as aggregate operations, which enables <strong>signac-flow</strong> to have consistent logic for generating submission scripts and status outputs.</p>
<h2 id="approach-for-implementation">Approach for Implementation</h2>
<p><strong>signac-flow</strong> supports 6 commands for a <code class="language-plaintext highlighter-rouge">FlowProject</code>’s command line interface: <code class="language-plaintext highlighter-rouge">run</code>, <code class="language-plaintext highlighter-rouge">exec</code>, <code class="language-plaintext highlighter-rouge">next</code>, <code class="language-plaintext highlighter-rouge">submit</code>, <code class="language-plaintext highlighter-rouge">script</code>, and <code class="language-plaintext highlighter-rouge">status</code>.</p>
<p>For the community bonding period, the approach to this project was trying to figure out how these commands get executed.
For the coding period, the strategy will be to work separately on these commands in the order mentioned above, i.e. firstly, ensuring the proper working of the <code class="language-plaintext highlighter-rouge">run</code> command then jumping to <code class="language-plaintext highlighter-rouge">exec</code> and so on.</p>
<h2 id="progress-on-the-project">Progress on the Project</h2>
<p>For now, I have the first iteration of my project ready for review in <a href="https://github.com/glotzerlab/signac-flow/pull/289" target="_blank">this pull request</a>.
The <strong>signac</strong> team made it possible due to their valuable efforts on helping me, specially in the community bonding period.</p>
<p>The code is not yet optimized, and I plan on optimizing the code as soon as I get reviews.
As far as the code quality is concerned, I have followed the <a href="https://github.com/glotzerlab/signac-flow/blob/master/CONTRIBUTING.md#code-style" target="_blank">community guidelines for code style</a>.</p>
<p>A huge thanks to the mentors for their valuable efforts for bringing out the best in me.
I also hope to provide users with some syntactic sugar.
Thank you for sticking with me again, in my next blog I’ll explain a detailed working of <strong>signac-flow</strong> with aggregates.</p>Hardik Ojhahojha@ee.iitr.ac.inHey, welcome back. If you’re new to the Series of Blogs by Hardik, you may enjoy reading my first blog. In this post, I will briefly describe the proposed API, my approach to the project, and the progress I’ve made so far.