.. tutorial:: Loading From and Saving To S3 Buckets :tags: topic_load_save A lesson on using s3-fuse with Iris to load/save data from/to S3 buckets. .. _s3_io: Loading From and Saving To S3 Buckets ===================================== For cloud computing, it is natural to want to access data storage based on URIs. At the present time, by far the most widely used platform for this is `Amazon S3 "buckets" `_. It is common to treat an S3 bucket like a "disk", storing files as individual S3 objects. S3 access URLs can also contain a nested `'prefix string' `_ structure, which naturally mirrors sub-directories in a file-system. While it would be possible for Iris to support S3 access directly, as it does the "OpenDAP" protocol for netCDF data, this approach has some serious limitations : most notably, each supported file format would have to be separately extended to support S3 URLs in the place of file paths for loading and saving. Instead, we have found that it is most practical to perform this access using a virtual file system approach. However, one drawback is that this is best controlled *outside* the Python code -- see details below. TL;DR ----- Install s3-fuse and use its ``s3fs`` command, to create a file-system mount which maps to an S3 bucket. S3 objects can then be accessed as a regular files (read and write). Fsspec, S3-fs, fuse and s3-fuse -------------------------------- This approach depends on a set of related code solutions, as follows: `fsspec `_ is a general framework for implementing Python-file-like access to alternative storage resources. `s3fs `_ is a package based on fsspec, which enables Python to "open" S3 data objects as Python file-like objects for reading and writing. `fuse `_ is an interface library that enables a data resource to be "mounted" as a Linux filesystem, with user (not root) privilege. `s3-fuse `_ is a utility based on s3fs and fuse, which provides a POSIX-compatible "mount" so that an S3 bucket can be accessed as a regular Unix file system. Practical usage --------------- Of the above, the only thing you actually need to know about is **s3-fuse**. There is an initial one-time setup, and also actions to take in advance of launching Python, and after exit, each time you want to access S3 from Python. Prior requirements ^^^^^^^^^^^^^^^^^^ Install "s3-fuse" ~~~~~~~~~~~~~~~~~ The most reliable method is to install into your Linux O.S. See `installation instructions `_ . This presumes that you perform a system installation with ``apt``, ``yum`` or similar. If you do not have necessary 'sudo' or root access permissions, we have found that it is sufficient to install only **into your Python environment**, using conda. Though not suggested, this appears to work on Unix systems where we have tried it. For this, you can use conda -- e.g. .. code-block:: bash $ conda install s3-fuse ( Or better, put it into a reusable 'spec file', with all other requirements, and then use ``$ conda create --file ...`` ). .. note:: It is **not** possible to install s3fs-fuse into a Python environment with ``pip``, as it is not a Python package. Create an empty mount directory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You need an empty directory in your existing filesystem tree, that you will map your S3 bucket **onto** -- e.g. .. code-block:: bash $ mkdir /home/self.me/s3_root/testbucket_mountpoint Setup AWS credentials ~~~~~~~~~~~~~~~~~~~~~ Provide S3 access credentials in an AWS credentials file, as described `here in the the s3-fuse documentation `_. There is a general introduction to AWS credentials `here in the AWS documentation `_ which should explain what you need here. Before use (before each Python invocation) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Activate your Python environment, which then gives access to the **s3-fuse** Linux command "s3fs". Map your S3 bucket "into" the chosen empty directory -- e.g. .. code-block:: bash $ s3fs my-test-bucket /home/self.me/s3_root/testbucket_mountpoint .. note:: You can now freely list/access contents of your bucket at this path -- including updating or writing files. .. note:: This performs a Unix file-system "mount" operation, which temporarily modifies your system. This change is not part of the current environment, and is not limited to the scope of the current process. If you reboot, the mount will disappear. If you logout and login again, there can be problems : ideally you should avoid this by always "unmounting" (see below). .. note:: The command for mounting an s3-fuse filesystem is ``s3fs`` - this should not be confused with the similarly named s3fs python package. Within Python code ^^^^^^^^^^^^^^^^^^ You can now access objects at the remote S3 URL via the mount point on your local file system you just created with `s3fs`, e.g. .. code-block:: python >>> path = "/home/self.me/s3_root/testbucket_mountpoint/sub_dir/a_file.nc" >>> cubes = iris.load(path) After use (after Python exit) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When you have finished accessing the S3 objects in the mounted virtual filesystem, it is a good idea to **unmount** it. Before doing this, make sure that all file handles to the objects have been closed and there are no terminals open in that directory. .. code-block:: bash $ umount /home/self.me/s3_root/testbucket_mountpoint .. note:: The ``umount`` command is a standard Unix command. It may not always succeed, in which case some kind of retry may be needed -- see detail notes below. The mount created will not survive a system reboot, nor does it function correctly if the user logs out + logs in again. Presumably, problems could occur if repeated operation were to create a very large number of mounts, so unmounting after use does seem advisable. Some Pros and Cons of this approach ----------------------------------- PROs ^^^^ * **s3fs** supports random access to "parts" of a file, allowing efficient handling of datasets larger than memory without requiring the data to be explicitly sharded in storage. * **s3-fuse** is transparent to file access within Python, including Iris load+save or other files accessed via a Python 'open' : the S3 data appears to be files in a regular file-system. * the file-system virtualisation approach works for all file formats, since the mapping occurs in the O.S. rather than in Iris, or Python. * "mounting" avoids the need for the Python code to dynamically connect to / disconnect from an S3 bucket. * the "unmount problem" (see below) is managed at the level of the operating system, where it occurs, instead of trying to allow for it in Python code. This means it could be managed differently in different operating systems, if needed. * it does also work with many other cloud object-storage platforms, though with extra required dependencies in some cases. See the s3fs-fuse `Non-Amazon S3`_ docs page for details. CONs ^^^^ * only works on Unix-like O.S. * requires the "fuse" kernel module to be supported in your O.S. This is usually installed by default, but may not always be. See `'fuse' kernel module `_ for more detail. * the file-system virtualisation may not be perfect : some file-system operations might not behave as expected, e.g. with regard to file permissions or system information. * it requires user actions *outside* the Python code. * the user must manage the mount/umount context. * some similar cloud object-storage platforms are *not* supported. See the s3fs-fuse `Non-Amazon S3`_ docs page for details of those which are. Background Notes and Details ---------------------------- * The file-like objects provided by **fsspec** replicate nearly *all* the behaviours of a regular Python file. However, this is still hard to integrate with regular file access, since you cannot create one from a regular Python "open" call -- still less when opening a file with an underlying file-format such as netCDF4 or HDF5 (since these are usually implemented in other languages such as C). Nor can you interrogate file paths or system metadata, e.g. permissions. So, the key benefit offered by **s3-fuse** is that all functions are mapped onto regular O.S. file-system calls -- so the file-format never needs to know that the data is not a "real" file. * It would be possible, instead, to copy data into an *actual* file on disk, but the s3-fuse approach avoids the need for copying, and thus in a cloud environment also the cost and maintenance of a "local disk". s3fs also allows the software to access only *required* parts of a file, without copying the whole content. This is obviously essential for efficient use of large datasets, e.g. when larger than available memory. * It is also possible to use **s3-fuse** to establish the mounts *from within Python*. However, we have considered integrating this into Iris and rejected it because of unavoidable problems : namely, the "umount problem" (see below). For details, see : https://github.com/SciTools/iris/pull/6731 * "Unmounting" must be done via a shell ``umount`` command, and there is no easy way to guarantee that this succeeds, since it can often get a "target is busy" error. This "umount problem" is a known problem in Unix generally : see `here `_ . It can only be resolved by a delay + retry. .. _Non-Amazon S3: https://github.com/s3fs-fuse/s3fs-fuse/wiki/Non-Amazon-S3