Skip to content

Allow exporting items data in CSV format #100

Open
@stav

Description

@stav

I am using the client to request items but I would like to limit them because some of the fields are way too big. So I only need a few fields but I want all the items. For example, this works fine for CSV if I declare the fields parameter:

$ curl -uAPIKEY: "https://storage.scrapinghub.com/items/244066/83/3?format=csv&fields=name,venue"

"Ole Miss","Vaught Hemingway Stadium"
"Kansas State","Bill Snyder Family Stadium"
"LSU","Tiger Stadium"

But when I try it with the client I get:

Python 3.6.3 (default, Oct  3 2017, 21:45:48)
>>> import scrapinghub
>>> scrapinghub.__version__
'2.0.3'
>>> client = scrapinghub.ScrapinghubClient(APIKEY)
>>> job = client.get_job('244066/83/3')
>>> items = job.items.list(format='csv', fields=['name,venue'])

requests.exceptions.HTTPError: 406 Client Error: Not Acceptable for url:
https://storage.scrapinghub.com/items/244066/83/3?format=csv&fields=name%2Cvenue

scrapinghub.client.exceptions.ScrapinghubAPIError: No acceptable
content types matching header 'application/x-msgpack' and format 'csv'
The following are supported: application/x-msgpack, application/xml,
text/csv, application/json, application/x-jsonlines

Ok, so let's try without msgpack:

>>> client = scrapinghub.ScrapinghubClient(APIKEY, use_msgpack=False)
>>> job = client.get_job('244066/83/3')
>>> items = job.items.list(format='csv', fields=['name,venue'])

File "scrapinghub/hubstorage/serialization.py", line 25, in jldecode
    yield loads(line)
json.decoder.JSONDecodeError: Extra data: line 1 column 11 (char 10)

So the problem is that the response is assumed by the client to be JSON and it tries to decode the string:

'"Ole Miss","Vaught Hemingway Stadium"'

Ok, let's try it with json now:

>>> items = job.items.list(format='json', fields=['name,venue'])
>>> items

[[{'name': 'Ole Miss', 'venue': 'Vaught Hemingway Stadium',
   'venue_address': 'All-American Dr, University, MS 38677, EUA',
   'date': 1542857400000.0,...

Well, there's no error, but we get all the fields instead of just the two we request, effectively the fields parameter is ignored.

So maybe we could patch scrapinghub/hubstorage/resourcetype.py:apirequest() to check for the csv format to bypass the json decoding but it would actually be better if the backend api supported this field subset declaration for other formats, namely json.

I see that the api supports max_fields and we know that csv supports field limiting so maybe it's not a big deal to get the api to support field limiting for json as well.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions