Skip to content

TST: Change test_html to use stored data + mark other #4009

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 3, 2013

Conversation

jtratner
Copy link
Contributor

Changed double decorated tests previously network and slow so they actually get run + stored usda_data such that the tests don't fail on network access.

@cpcloud
Copy link
Member

cpcloud commented Jun 24, 2013

thanks must have missed that when i wrote them. i think slow should be removed since either slow or network tests get run on travis iirc

@cpcloud
Copy link
Member

cpcloud commented Jun 24, 2013

also there's already spam.html in pandas/io/tests/data. i explicitly want to connect to the network here just in case something has changed, although there's no reason to believe the composition of spam will change anytime soon. so probably good to use spam.html

@jtratner
Copy link
Contributor Author

Okay, I'll make that change. Do you want to store data locally for these or leave with the network request? html file is 32K

@cpcloud
Copy link
Member

cpcloud commented Jun 24, 2013

the banklist data set for example is updated every monday and so it's basically frozen in time and you cannot compare the data obtained with that that is stored locally. i get 28k (for spam) but that's splitting hairs... probably ok to leave local

@cpcloud
Copy link
Member

cpcloud commented Jun 24, 2013

i did du -h $(find . -name 'spam.html')

@jtratner
Copy link
Contributor Author

Huh? Currently it's not local for the html file test. (most of the other test cases are). Plus these are basically just checks that lxml and BeautifulSoup actually work as expected.

1 similar comment
@jtratner
Copy link
Contributor Author

Huh? Currently it's not local for the html file test. (most of the other test cases are). Plus these are basically just checks that lxml and BeautifulSoup actually work as expected.

@cpcloud
Copy link
Member

cpcloud commented Jun 24, 2013

it's ok to change to local if u want that's what i meant

@jtratner
Copy link
Contributor Author

Sorry for the double comment there - poor connection. The html file that the lxml and bs4 tests check for is 32k.

@jtratner
Copy link
Contributor Author

Anyways I don't have a strong opinion. I'll edit this to remove slow soon.

@cpcloud
Copy link
Member

cpcloud commented Jun 24, 2013

oh u mean the one u added? or the one that is received by urlopen? i'm confused....all i'm saying is that if you want to change to use the local files instead of calling the network you can...they are there, spam is called spam.html and the banklist data set is called banklist.html...make sense?

@jtratner
Copy link
Contributor Author

@cpcloud yeah, that does. just wondering if you think it's worth it - sorry I wasn't clear earlier

@cpcloud
Copy link
Member

cpcloud commented Jun 24, 2013

Since these are essentially smoke tests for the backend you're right best to move to slow decorators and remove the network decorators.

@jtratner
Copy link
Contributor Author

one big reason why these take so long is that they are processing a huge
file. Is that necessary or can I reduce it to just the relevant portion
(i.e., <htm><head></head><body><table>TABLESTUFFHERE</table></body>)?

On Mon, Jun 24, 2013 at 9:32 AM, Phillip Cloud notifications@github.comwrote:

Since these are essentially smoke tests for the backend you're right best
to move to slow decorators and remove the network decorators.


Reply to this email directly or view it on GitHubhttps://github.com//pull/4009#issuecomment-19906757
.

@cpcloud
Copy link
Member

cpcloud commented Jun 24, 2013

wanted to test on possibly invalid html from the wild. if you can find a smaller file that works that's great too

@jtratner
Copy link
Contributor Author

@cpcloud hmm...well, just going to leave it as slow I guess. Sort of a bummer that it can't be both.

@cpcloud
Copy link
Member

cpcloud commented Jun 25, 2013

if you come across another data set in the wild please put up a PR, would love to have both just don't have time to check right now for data sets from pages that are both small and rife with HTML errors. this scraping framework might be of use for finding all kinds of different HTML tables.

@jtratner
Copy link
Contributor Author

@cpcloud Changed double decorated tests previously network and slow so they actually get run + stored usda_data such that the tests don't fail on network access. The html file is 32k, not sure whether that's too large.

@cpcloud
Copy link
Member

cpcloud commented Jun 25, 2013

i'm confused..why not just use the spam data set that's already there?

@jtratner
Copy link
Contributor Author

@cpcloud ohh, good point. Somehow I thought there was some decision to use a different file.

@jtratner
Copy link
Contributor Author

changed it.

url = ('http://ndb.nal.usda.gov/ndb/foods/show/1732?fg=&man=&'
'lfacet=&format=&count=&max=25&offset=&sort=&qlookup=spam')
assert get_lxml_elements(url, 'tbody')
filepath = "file://" + os.path.join(DATA_PATH, "spam.html")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the explicit file protocol necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes...doesn't work otherwise. (because it gets passed directly to the functions)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you know what can u change the get_lxml_elements to take a kwarg called base_url='file://' then pass that to the parse call as base_url like so: parse(url, base_url=base_url) then you don't have to concat with strings

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

@jreback look ok? i think this can be merged

@jreback
Copy link
Contributor

jreback commented Jun 27, 2013

looks ok

@jtratner
Copy link
Contributor Author

Does this need rebase?

@jtratner
Copy link
Contributor Author

@cpcloud more fun with network failures! :D https://travis-ci.org/jtratner/pandas/jobs/8559725 [pretty sure this is off of master and not due to this commit]

@cpcloud
Copy link
Member

cpcloud commented Jun 29, 2013

darn. that is off of master. i'm not even sure what the issue is there. i'll delve deeper. i really wish i had a way to reproduce this stuff. i may end up just setting up my own CI on my machines at home

@jtratner
Copy link
Contributor Author

@cpcloud Maybe you could set up something that tracks all connections open at any one time (maybe mod urlopen to do it?) then, when it raises an error, have it print the entire set of open network connections.

@cpcloud
Copy link
Member

cpcloud commented Jun 29, 2013

there is no hope. will just catch this exception...teh interwebz r tricky

@jtratner
Copy link
Contributor Author

@cpcloud that seems like the best bet here. Given the confusing-ness of the error, definitely something that the datareaders should handle themselves.

@cpcloud
Copy link
Member

cpcloud commented Jun 29, 2013

badstatusline exceptions are just like throwing Exception: completely uninformative if you want to use it to handle things gracefully

@jtratner
Copy link
Contributor Author

I don't know what you are talking about. BadStatusLine exceptions indicate
that there was a bad status line. Duh. ;P

On Sat, Jun 29, 2013 at 10:58 AM, Phillip Cloud notifications@github.comwrote:

badstatusline exceptions are just like throwing Exception: completely
uninformative if you want to use it to handle things gracefull


Reply to this email directly or view it on GitHubhttps://github.com//pull/4009#issuecomment-20231241
.

@cpcloud
Copy link
Member

cpcloud commented Jun 29, 2013

ha. they are so specific with their empty string error message too

@jtratner
Copy link
Contributor Author

jtratner commented Jul 1, 2013

@cpcloud can this (or something like it) get merged soon? These tests failed loudly when no internet available...

@cpcloud
Copy link
Member

cpcloud commented Jul 1, 2013

Sure maybe later tonight if not early tmrw morning

@cpcloud
Copy link
Member

cpcloud commented Jul 2, 2013

@jtratner can u rebase? once travis passes i can merge. sorry this fell by the wayside a bit.

@jtratner
Copy link
Contributor Author

jtratner commented Jul 2, 2013

@cpcloud I thought this was handled by your data.py edit, which is why I wasn't thinking about it - didn't even notice anything until I had the internet turned off :P

@cpcloud
Copy link
Member

cpcloud commented Jul 2, 2013

No problem. I discovered yet another failure in data.py but I'm waiting to see if can reproduce it before making an issue. I feel like there has to be a better--meaning more reliable--approach. I wonder if yahoo and or Google has an api for this stuff. More so than just urlopening the CSV.

+ Change up test decorators so no tests marked as network and slow.
@jtratner
Copy link
Contributor Author

jtratner commented Jul 2, 2013

rebased ftw

cpcloud added a commit that referenced this pull request Jul 3, 2013
TST: Change test_html to use stored data + mark other
@cpcloud cpcloud merged commit 0523dc6 into pandas-dev:master Jul 3, 2013
@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

thanks!

@jtratner jtratner deleted the add-network-label-to-test-html branch September 21, 2013 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants