Philipp Adelt

Poetic justice of cause and effect

TwitterGitHubScientific Research


Consulting services

I offer ERP and big-data infrastructure consulting and development with an agile team of professionals.

If you're interested in working together, feel free to contact me.


More about me.

So you want to list all your S3 files

TL;DR Amazon S3's AWS4-HMAC-SHA256-only-regions (e.g. eu-central-1 in Frankfurt) are a mess when combined with boto, the go-to Python library.

Your non-public S3 buckets are spread over several regions/locations, because that is how life goes. But no problem - S3 is available from everywhere, right?

OK - just write a script to enumerate them all. You like Python, so Boto.

pa15(~) $ pip install boto
Downloading/unpacking boto
  Downloading boto-2.36.0-py2.py3-none-any.whl (1.3MB): 1.3MB downloaded
Installing collected packages: boto
Successfully installed boto
Cleaning up...

Nice. So we want to start by enumerating all buckets, then go through the buckets and enumerate all the objects (files in non-S3-land).

s3-1.py (Source)

import boto

def list_all_buckets(s3):
    buckets = s3.get_all_buckets()
    return buckets

def main():
    s3 = boto.connect_s3(
        aws_access_key_id=ACCESS_KEY_ID,
        aws_secret_access_key=SECRET_ACCESS_KEY)
    buckets = list_all_buckets(s3)

    print buckets

main()

To get that working, you should create a user you set up in AWS IAM. Then attach a policy like this:

s3-minimal-policy.json (Source)

{
    "Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "s3:ListAllMyBuckets",
            "s3:ListBucket",
            "s3:GetBucketLocation"
        ],
        "Resource": "arn:aws:s3:::*"
    }
]
}

What happens when we run that?

pa15(~/s3) $ python s3-1.py
[u'pan-duplicity', u'invoices-padelt', u'ipa-backup', u'logs-fra', u'padelt-logs', u'pan-cloudtrail', u'pan-duplicity.s3.amazonaws.com', u'ws-www4']

Well, that was... easy. Nice!

So, listing the objects in each bucket should be easy, too.

s3-2.py (Source)

import boto

def list_all_buckets(s3):
    buckets = s3.get_all_buckets()
    return buckets

def main():
    s3 = boto.connect_s3(
        aws_access_key_id=ACCESS_KEY_ID,
        aws_secret_access_key=SECRET_ACCESS_KEY)
    buckets = list_all_buckets(s3)

    for bucket in buckets:
        print("First objects from bucket '{}' in location {}:".
            format(bucket.name, bucket.get_location()))
        for i, obj in enumerate(bucket.list()):
            if i > 3:
                break
            print("  {}".format(repr(obj)))

main()
pa15(~/s3) $ python s3-2.py
First objects from bucket 'pan-duplicity' in location EU:
  <Key: duplicity/duplicity-full-signatures.20150126T003601Z.sigtar.gpg>
  <Key: duplicity/duplicity-full.20141130T003414Z.manifest.gpg>
First objects from bucket 'invoices-padelt' in location eu-west-1:
  <Key: invoices-padelt,8950330-aws-billing-csv-2014-05.csv>
  <Key: invoices-padelt,8950330-aws-billing-csv-2014-06.csv>
  <Key: invoices-padelt,8950330-aws-billing-csv-2014-07.csv>
  <Key: invoices-padelt,8950330-aws-billing-csv-2014-08.csv>
Traceback (most recent call last):
  File "s3-2.py", line 26, in <module>
    main()
  File "s3-2.py", line 20, in main
    format(bucket.name, bucket.get_location()))
  File "/usr/local/lib/python2.7/site-packages/boto/s3/bucket.py", line 1144, in get_location
    response.status, response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidRequest</Code><Message>The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.</Message><RequestId>...</RequestId><HostId>...</HostId></Error>

Well... what? Let's unpack what happened here:

  • First of all: What is location EU? There are two regions in the European Union, being eu-west-1 (Ireland) and eu-central-1 (Frankfurt/Main). In fact both buckets in the listing are located in eu-west-1 (as displayed in the AWS console).
  • Suddenly, on the third bucket, S3 is telling me I am not using the right mechanism. You know what? That's boto's job, not mine. In fact, abstracting that away is the sole reason not to roll everything yourself.

The version 4 HMAC auth situation

Now what does every self-respecting developer do first? Right, google it. Or bing it if you are so inclined. Eventually, you may come upon the code that does the actual Version 4 HMAC auth for S3 in boto. Just read the comments in the canonical Python library for S3. I feel pity for those developers.

But I digress. We need that working - like - now, because, ain't nobody got time for that. By now you should have guessed - the Frankfurt/Main region enforces use of version 4 signatures. So how do you make boto use that? Maybe on Stackoverflow? That is unanswered yet - but now we got another term to search for: use-sigv4

Hmm, something completely unrelated suggests using an environment variable to make boto do what you (seem to) want.

More digging leads into the boto source and that contains some gems:

s3-3.py (Source)

SIGV4_DETECT = [
    '.cn-',
    # In eu-central we support both host styles for S3
    '.eu-central',
    '-eu-central',
]

def detect_potential_s3sigv4(func):
    def _wrapper(self):
        if os.environ.get('S3_USE_SIGV4', False):
            return ['hmac-v4-s3']

        if boto.config.get('s3', 'use-sigv4', False):
            return ['hmac-v4-s3']

        if hasattr(self, 'host'):
            # If you're making changes here, you should also check
            # ``boto/iam/connection.py``, as several things there are also
            # endpoint-related.
            for test in SIGV4_DETECT:
                if test in self.host:
                    return ['hmac-v4-s3']

        return func(self)
    return _wrapper

So before thinking too much about it, let's try what the code suggests:

pa15(~/s3) $ S3_USE_SIGV4=1 python s3-2.py
Traceback (most recent call last):
  File "s3-2.py", line 26, in <module>
    main()
  File "s3-2.py", line 15, in main
    aws_secret_access_key=SECRET_ACCESS_KEY)
  File "/usr/local/lib/python2.7/site-packages/boto/__init__.py", line 141, in connect_s3
    return S3Connection(aws_access_key_id, aws_secret_access_key, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/boto/s3/connection.py", line 196, in __init__
    "When using SigV4, you must specify a 'host' parameter."
boto.s3.connection.HostRequiredError: BotoClientError: When using SigV4, you must specify a 'host' parameter.

You need to... WHAT? Host parameter? More digging in the code. You learn that S3Connection would rather get a host parameter but defaults to s3.amazonaws.com - before failing that is. So let's say you feed it the default explicitly, what happens?

s3-4.py (Source)

def main():
    s3 = boto.connect_s3(
        host='s3.amazonaws.com',
        aws_access_key_id=ACCESS_KEY_ID,
        aws_secret_access_key=SECRET_ACCESS_KEY)
    buckets = list_all_buckets(s3)
pa15(~/s3) $ S3_USE_SIGV4=1 python listings/s3-3.py
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/boto/s3/bucket.py", line 1144, in get_location
    response.status, response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AuthorizationHeaderMalformed</Code><Message>
The authorization header is malformed;
the region 'us-east-1' is wrong;
expecting 'eu-west-1'</Message>
<Region>eu-west-1</Region>
<RequestId>BE16C98097C12FA6</RequestId><HostId>+neOBLsDWqNiCoQxZn+1yK4slOGmX6EvBBc5CuHtycFBzJiDBsbJLYK5QAkhkUhLAWUOngKCeCk=</HostId></Error>

Oh right, the bucket is in eu-west-1 (buckets are after all located in one and not any region) and you passed in us-east-1. So you hunt down the Amazon S3 endpoint list and indeed s3.amazonaws.com`=`us-east-1. You also learn that EU=`eu-west-1`. To add some comical element to all this, S3 in Frankfurt is the only region where you can use one of two endpoints:

  • s3.eu-central-1.amazonaws.com
  • s3-eu-central-1.amazonaws.com

Hot coffee all day for the one who has a compelling argument why that was a great idea. Anyway, that won't help you because all you learn is: You now need to specify the host depending on the bucket location. Otherwise:

s3-6.xml (Source)

<Error>
    <Code>PermanentRedirect</Code>
    <Message>The bucket you are attempting to access
        must be addressed using the specified endpoint.
        Please send all future requests to this endpoint.
    </Message>
    <Bucket>somebucket</Bucket>
    <Endpoint>somebucket.s3-us-west-2.amazonaws.com</Endpoint>
    ...
</Error>

Let's reflect for a moment: You got a list of bucket instances from s3.get_all_buckets(). But you can't access them via this instance object, because... computers.

To be continued

I do have a solution and hope to write it up in the future. The point I am trying to make is:

The developer experience trying to access a diverse set of buckets in an S3 account using boto is WTF-worthy. Boto is in the center of it all but let's be realistic here: Amazon obviously needs to address this in any way they can. So please, use some of the money I give you for storage and make these embarrassment a thing of the past!

[ AWS   S3 ]