Skip to content

DataSink S3 support #1316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 55 commits into from
Feb 3, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
4b3f926
add resource multiproc plugin
carolFrohlich Sep 29, 2015
6f4690b
callback functions write log
carolFrohlich Sep 29, 2015
52da583
fix multiproc tests. create lot 2 json converter
carolFrohlich Sep 30, 2015
ffb4756
fix comments and logs
carolFrohlich Sep 30, 2015
0890e81
fix tests
carolFrohlich Oct 1, 2015
b3c6afc
Modified the DataSink class and DataSinkInputSpec class to be able to…
pintohutch Oct 6, 2015
4b02558
Removed redundant imports
pintohutch Oct 6, 2015
42f0b1b
Quick cosmetic fix
pintohutch Oct 6, 2015
872e752
scheduler does not sleep
carolFrohlich Oct 7, 2015
e465c28
clean code
carolFrohlich Oct 8, 2015
e49965c
draw gant chart, small fixes
carolFrohlich Oct 8, 2015
34acdf8
add memory and thread to gantt chart, callback handles errors
carolFrohlich Oct 8, 2015
c9c92ef
Added handling of DataSink to save to a local directory if it cant ac…
pintohutch Oct 8, 2015
cb07b5a
add tests
carolFrohlich Oct 9, 2015
827d2c2
fix method name
carolFrohlich Oct 9, 2015
70897b2
Merge branch 'master' of https://github.com/carolFrohlich/nipype
carolFrohlich Oct 9, 2015
0856bca
Merge branch 'master' of https://github.com/dclark87/nipype
carolFrohlich Oct 9, 2015
a8f8006
fix typos
carolFrohlich Oct 9, 2015
300d20c
Update io.py
pintohutch Oct 15, 2015
0503c23
Added md5 checking for s3
pintohutch Oct 15, 2015
e3ad668
Merge pull request #1 from FCP-INDI/master
pintohutch Oct 15, 2015
f6cfad7
Added message about file already existsing
pintohutch Oct 15, 2015
0529444
Merge pull request #2 from dclark87/master
carolFrohlich Oct 16, 2015
f107efd
Merge pull request #1 from dclark87/patch-1
carolFrohlich Oct 16, 2015
fdcab2a
Merge pull request #2 from FCP-INDI/master
pintohutch Oct 21, 2015
186d00a
Fixed dive by 0 bug
pintohutch Oct 21, 2015
f77371b
Added upper/lower case support for S3 prefix
pintohutch Oct 30, 2015
e2f51f6
Added support for both non-root and root AWS creds in DataSink
pintohutch Nov 3, 2015
f34b6d6
Merge pull request #3 from dclark87/master
pintohutch Nov 12, 2015
350fd4a
add attribute real_memory to interface, change attr memory to estimat…
carolFrohlich Nov 25, 2015
f74fe25
Added real memory recording to plugn
ccraddock Nov 25, 2015
1e66b86
Added initial code for getting used memory of node
ccraddock Nov 25, 2015
716f923
Fixed logging of real memory
ccraddock Dec 2, 2015
ff7959a
Added per node runtime logging
ccraddock Dec 2, 2015
d25afb5
Removed debugging print statements
pintohutch Dec 10, 2015
00a470b
sync with master
carolFrohlich Dec 30, 2015
89d7e9c
Added fakes3 integration with datasink and started adding a local_cop…
pintohutch Jan 7, 2016
613d8cb
Merge branch 'master' of https://github.com/fcp-indi/nipype
pintohutch Jan 7, 2016
a70c81e
Finished adding local_copy logic and passed all unit tests
pintohutch Jan 8, 2016
2af5c1d
Removed memory profiler stuff for now
pintohutch Jan 8, 2016
b7e9309
Removed the memory profiler code to just pull in s3 datasink code
pintohutch Jan 8, 2016
0e5e0e9
Removed unneccessary import
pintohutch Jan 8, 2016
0f78025
Removed unncessary function argument
pintohutch Jan 8, 2016
15f3ced
Corrected Carol's in fsl interface code
pintohutch Jan 8, 2016
ca4bed5
Removed all of the ResourceMultiProc plugin so the S3 datasink
pintohutch Jan 11, 2016
0d7419e
Manually fixed conflicts
pintohutch Jan 12, 2016
0e6a42b
Merge branch 'nipy-master' into s3_datasink
pintohutch Jan 12, 2016
ecb05e2
Found merge HEAD comment and removed
pintohutch Jan 12, 2016
ee70359
Removed print statements from fakes3 checker and made it a check at t…
pintohutch Jan 12, 2016
7ecaefd
Changed fakes3_found to fakes3
pintohutch Jan 12, 2016
818da99
Fixed Python3 compatibility bug in exception raising
pintohutch Jan 13, 2016
49c14f8
Made exceptions more explicit
pintohutch Jan 13, 2016
a9dd168
Removed S3DataSink and changed dummy file writing to be Python2/3 com…
pintohutch Jan 14, 2016
c2eedc7
Added aws.rst file documenting use of new S3 capabilities in the Data…
pintohutch Feb 2, 2016
c0d148a
Removed bucket from being an attribute of the DataSink and just made …
pintohutch Feb 3, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions doc/users/aws.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
.. _aws:

============================================
Using Nipype with Amazon Web Services (AWS)
============================================
Several groups have been successfully using Nipype on AWS. This procedure
involves setting a temporary cluster using StarCluster and potentially
transferring files to/from S3. The latter is supported by Nipype through
DataSink and S3DataGrabber.


Using DataSink with S3
======================
The DataSink class now supports sending output data directly to an AWS S3
bucket. It does this through the introduction of several input attributes to the
DataSink interface and by parsing the `base_directory` attribute. This class
uses the `boto3 <https://boto3.readthedocs.org/en/latest/>`_ and
`botocore <https://botocore.readthedocs.org/en/latest/>`_ Python packages to
interact with AWS. To configure the DataSink to write data to S3, the user must
set the ``base_directory`` property to an S3-style filepath. For example:

::

import nipype.interfaces.io as nio
ds = nio.DataSink()
ds.inputs.base_directory = 's3://mybucket/path/to/output/dir'

With the "s3://" prefix in the path, the DataSink knows that the output
directory to send files is on S3 in the bucket "mybucket". "path/to/output/dir"
is the relative directory path within the bucket "mybucket" where output data
will be uploaded to (NOTE: if the relative path specified contains folders that
don’t exist in the bucket, the DataSink will create them). The DataSink treats
the S3 base directory exactly as it would a local directory, maintaining support
for containers, substitutions, subfolders, "." notation, etc to route output
data appropriately.

There are four new attributes introduced with S3-compatibility: ``creds_path``,
``encrypt_bucket_keys``, ``local_copy``, and ``bucket``.

::

ds.inputs.creds_path = '/home/user/aws_creds/credentials.csv'
ds.inputs.encrypt_bucket_keys = True
ds.local_copy = '/home/user/workflow_outputs/local_backup'

``creds_path`` is a file path where the user's AWS credentials file (typically
a csv) is stored. This credentials file should contain the AWS access key id and
secret access key and should be formatted as one of the following (these formats
are how Amazon provides the credentials file by default when first downloaded).

Root-account user:

::

AWSAccessKeyID=ABCDEFGHIJKLMNOP
AWSSecretKey=zyx123wvu456/ABC890+gHiJk

IAM-user:

::

User Name,Access Key Id,Secret Access Key
"username",ABCDEFGHIJKLMNOP,zyx123wvu456/ABC890+gHiJk

The ``creds_path`` is necessary when writing files to a bucket that has
restricted access (almost no buckets are publicly writable). If ``creds_path``
is not specified, the DataSink will check the ``AWS_ACCESS_KEY_ID`` and
``AWS_SECRET_ACCESS_KEY`` environment variables and use those values for bucket
access.

``encrypt_bucket_keys`` is a boolean flag that indicates whether to encrypt the
output data on S3, using server-side AES-256 encryption. This is useful if the
data being output is sensitive and one desires an extra layer of security on the
data. By default, this is turned off.

``local_copy`` is a string of the filepath where local copies of the output data
are stored in addition to those sent to S3. This is useful if one wants to keep
a backup version of the data stored on their local computer. By default, this is
turned off.

``bucket`` is a boto3 Bucket object that the user can use to overwrite the
bucket specified in their ``base_directory``. This can be useful if one has to
manually create a bucket instance on their own using special credentials (or
using a mock server like `fakes3 <https://github.com/jubos/fake-s3>`_). This is
typically used for developers unit-testing the DataSink class. Most users do not
need to use this attribute for actual workflows. This is an optional argument.

Finally, the user needs only to specify the input attributes for any incoming
data to the node, and the outputs will be written to their S3 bucket.

::

workflow.connect(inputnode, 'subject_id', ds, 'container')
workflow.connect(realigner, 'realigned_files', ds, 'motion')

So, for example, outputs for sub001’s realigned_file1.nii.gz will be in:
s3://mybucket/path/to/output/dir/sub001/motion/realigned_file1.nii.gz


Using S3DataGrabber
======================
Coming soon...
1 change: 1 addition & 0 deletions doc/users/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
spmmcr
mipav
nipypecmd
aws



Expand Down
Loading