Description
I am using the client to request items but I would like to limit them because some of the fields are way too big. So I only need a few fields but I want all the items. For example, this works fine for CSV if I declare the fields
parameter:
$ curl -uAPIKEY: "https://storage.scrapinghub.com/items/244066/83/3?format=csv&fields=name,venue"
"Ole Miss","Vaught Hemingway Stadium"
"Kansas State","Bill Snyder Family Stadium"
"LSU","Tiger Stadium"
But when I try it with the client I get:
Python 3.6.3 (default, Oct 3 2017, 21:45:48)
>>> import scrapinghub
>>> scrapinghub.__version__
'2.0.3'
>>> client = scrapinghub.ScrapinghubClient(APIKEY)
>>> job = client.get_job('244066/83/3')
>>> items = job.items.list(format='csv', fields=['name,venue'])
requests.exceptions.HTTPError: 406 Client Error: Not Acceptable for url:
https://storage.scrapinghub.com/items/244066/83/3?format=csv&fields=name%2Cvenue
scrapinghub.client.exceptions.ScrapinghubAPIError: No acceptable
content types matching header 'application/x-msgpack' and format 'csv'
The following are supported: application/x-msgpack, application/xml,
text/csv, application/json, application/x-jsonlines
Ok, so let's try without msgpack
:
>>> client = scrapinghub.ScrapinghubClient(APIKEY, use_msgpack=False)
>>> job = client.get_job('244066/83/3')
>>> items = job.items.list(format='csv', fields=['name,venue'])
File "scrapinghub/hubstorage/serialization.py", line 25, in jldecode
yield loads(line)
json.decoder.JSONDecodeError: Extra data: line 1 column 11 (char 10)
So the problem is that the response is assumed by the client to be JSON and it tries to decode the string:
'"Ole Miss","Vaught Hemingway Stadium"'
Ok, let's try it with json now:
>>> items = job.items.list(format='json', fields=['name,venue'])
>>> items
[[{'name': 'Ole Miss', 'venue': 'Vaught Hemingway Stadium',
'venue_address': 'All-American Dr, University, MS 38677, EUA',
'date': 1542857400000.0,...
Well, there's no error, but we get all the fields instead of just the two we request, effectively the fields
parameter is ignored.
So maybe we could patch scrapinghub/hubstorage/resourcetype.py:apirequest()
to check for the csv format to bypass the json decoding but it would actually be better if the backend api supported this field subset declaration for other formats, namely json.
I see that the api supports max_fields
and we know that csv supports field limiting so maybe it's not a big deal to get the api to support field limiting for json as well.