Allow exporting items data in CSV format

I am using the client to request items but I would like to limit them because some of the fields are way too big. So I only need a few fields but I want all the items. For example, this works fine for [CSV](https://doc.scrapinghub.com/scrapy-cloud.html#csv-parameters) if I declare the `fields` parameter:

    $ curl -uAPIKEY: "https://storage.scrapinghub.com/items/244066/83/3?format=csv&fields=name,venue"

    "Ole Miss","Vaught Hemingway Stadium"
    "Kansas State","Bill Snyder Family Stadium"
    "LSU","Tiger Stadium"

But when I try it with the client I get:

    Python 3.6.3 (default, Oct  3 2017, 21:45:48)
    >>> import scrapinghub
    >>> scrapinghub.__version__
    '2.0.3'
    >>> client = scrapinghub.ScrapinghubClient(APIKEY)
    >>> job = client.get_job('244066/83/3')
    >>> items = job.items.list(format='csv', fields=['name,venue'])

    requests.exceptions.HTTPError: 406 Client Error: Not Acceptable for url:
    https://storage.scrapinghub.com/items/244066/83/3?format=csv&fields=name%2Cvenue

    scrapinghub.client.exceptions.ScrapinghubAPIError: No acceptable
    content types matching header 'application/x-msgpack' and format 'csv'
    The following are supported: application/x-msgpack, application/xml,
    text/csv, application/json, application/x-jsonlines

Ok, so let's try without `msgpack`:

    >>> client = scrapinghub.ScrapinghubClient(APIKEY, use_msgpack=False)
    >>> job = client.get_job('244066/83/3')
    >>> items = job.items.list(format='csv', fields=['name,venue'])

    File "scrapinghub/hubstorage/serialization.py", line 25, in jldecode
        yield loads(line)
    json.decoder.JSONDecodeError: Extra data: line 1 column 11 (char 10)

So the problem is that the response is assumed by the client to be JSON and it tries to decode the string:

    '"Ole Miss","Vaught Hemingway Stadium"'

Ok, let's try it with json now:

    >>> items = job.items.list(format='json', fields=['name,venue'])
    >>> items

    [[{'name': 'Ole Miss', 'venue': 'Vaught Hemingway Stadium',
       'venue_address': 'All-American Dr, University, MS 38677, EUA',
       'date': 1542857400000.0,...

Well, there's no error, but we get all the fields instead of just the two we request, effectively the `fields` parameter is ignored.

So maybe we could patch `scrapinghub/hubstorage/resourcetype.py:apirequest()` to check for the csv format to bypass the json decoding but it would actually be better if the backend api supported this field subset declaration for other formats, namely json.

_[I see that the api supports `max_fields`](https://github.com/scrapinghub/hubstorage/blob/d5dcfe231c31e1b5cc980df52438ac4b177b47df/servlet/src/main/webapp/hstorage/v1/filters.py#L313) and we know that csv supports field limiting so maybe it's not a big deal to get the api to support field limiting for json as well._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow exporting items data in CSV format #100

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow exporting items data in CSV format #100

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions