Closed
Description
Converting floating-point value NaN to any integer data type is an undefined behavior in C. However, it actually happens in numpy extension module, which is probably caused by incorrect usage of it from pandas. If the former is built with Clang+UBSan (http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html), there are error reports indicating the issue.
Now, this is somewhat tricky to reproduce, because it involves building NumPy C code with UBSan. The following instructions should work on Ubuntu 14.04
- Get fresh enough Clang (e.g. 3.8.0rc2 from http://llvm.org/pre-releases/3.8.0/rc2)
- Build NumPy with Clang and float-cast-overflow detection
git clone git://github.com/numpy/numpy.git
cd numpy
CC=clang CXX=clang++ LDSHARED=clang CFLAGS="-fsanitize=float-cast-overflow" python setup.py install
- Fetch latest pandas
- Export ASan runtime library to provide UBSan implementation, setup runtime flags for sanitizers:
export ASAN_OPTIONS=detect_leaks=0
export UBSAN_OPTIONS=print_stacktrace=1
export LD_PRELOAD=/lib/clang/3.9.0/lib/linux/libclang_rt.asan-x86_64.so
- Build pandas
cd pandas
python setup.py build_ext --inplace
python setup.py install
- Run tests from the test suite triggering the issue
nosetests pandas/tests/test_groupby.py:TestGroupBy.test_agg_nested_dicts
numpy/core/src/multiarray/lowlevel_strided_loops.c.src:865:17: runtime error: value nan is outside the range of representable values of type 'long'
#0 0x7fc581bffe41 in _cast_double_to_long /usr/local/google/numpy/numpy/core/src/multiarray/lowlevel_strided_loops.c.src:865:17
#1 0x7fc581b7faf7 in raw_array_assign_array /usr/local/google/numpy/numpy/core/src/multiarray/array_assign_array.c:96:9
#2 0x7fc581b80355 in PyArray_AssignArray /usr/local/google/numpy/numpy/core/src/multiarray/array_assign_array.c:351:13
#3 0x7fc581c19a80 in array_astype /usr/local/google/numpy/numpy/core/src/multiarray/methods.c:832:13
#4 0x49968c in PyEval_EvalFrameEx (/usr/bin/python2.7+0x49968c)
nosetests pandas/tseries/tests/test_resample.py:TestResample.test_custom_grouper
numpy/core/src/multiarray/lowlevel_strided_loops.c.src:867:22: runtime error: value nan is outside the range of representable values of type 'long'
#0 0x7f04ba194da1 in _aligned_cast_double_to_long /usr/local/google/numpy/numpy/core/src/multiarray/lowlevel_strided_loops.c.src:867:22
#1 0x7f04ba13d5d2 in PyArray_CastRawArrays /usr/local/google/numpy/numpy/core/src/multiarray/dtype_transfer.c:3843:5
#2 0x7f04ba11485f in PyArray_AssignRawScalar /usr/local/google/numpy/numpy/core/src/multiarray/array_assign_scalar.c:248:13
#3 0x7f04ba11f28a in PyArray_FillWithScalar /usr/local/google/numpy/numpy/core/src/multiarray/convert.c:464:19
#4 0x7f04ba1af2a0 in array_fill /usr/local/google/numpy/numpy/core/src/multiarray/methods.c:150:9
#5 0x49968c in PyEval_EvalFrameEx (/usr/bin/python2.7+0x49968c)
It seems that pandas are using "np.nan" to aggressively: by calling astype(<integer_type>) on arrays that can contain NaNs, and in calling fill(np.nan) on np.empty() arrays of integral types.