Skip to content

Commit 4d00c59

Browse files
committed
Cleanup the memory handling for temporary buffers in
some of the collective modules. Added a new function opan_datatype_span, to compute the memory span of count number of datatype, excluding the gaps in the beginning and at the end. If a memory allocation is made using the returned value, the gap (also returned) should be removed from the allocated pointer.
1 parent 3def93a commit 4d00c59

23 files changed

+320
-342
lines changed
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
/* This comment applies to all collectives (including the basic
2+
* module) where we allocate a temporary buffer. For the next few
3+
* lines of code, it's tremendously complicated how we decided that
4+
* this was the Right Thing to do. Sit back and enjoy. And prepare
5+
* to have your mind warped. :-)
6+
*
7+
* Recall some definitions (I always get these backwards, so I'm
8+
* going to put them here):
9+
*
10+
* extent: the length from the lower bound to the upper bound -- may
11+
* be considerably larger than the buffer required to hold the data
12+
* (or smaller! But it's easiest to think about when it's larger).
13+
*
14+
* true extent: the exact number of bytes required to hold the data
15+
* in the layout pattern in the datatype.
16+
*
17+
* For example, consider the following buffer (just talking about
18+
* true_lb, extent, and true extent -- extrapolate for true_ub:
19+
*
20+
* A B C
21+
* --------------------------------------------------------
22+
* | | |
23+
* --------------------------------------------------------
24+
*
25+
* There are multiple cases:
26+
*
27+
* 1. A is what we give to MPI_Send (and friends), and A is where
28+
* the data starts, and C is where the data ends. In this case:
29+
*
30+
* - extent: C-A
31+
* - true extent: C-A
32+
* - true_lb: 0
33+
*
34+
* A C
35+
* --------------------------------------------------------
36+
* | |
37+
* --------------------------------------------------------
38+
* <=======================extent=========================>
39+
* <======================true extent=====================>
40+
*
41+
* 2. A is what we give to MPI_Send (and friends), B is where the
42+
* data starts, and C is where the data ends. In this case:
43+
*
44+
* - extent: C-A
45+
* - true extent: C-B
46+
* - true_lb: positive
47+
*
48+
* A B C
49+
* --------------------------------------------------------
50+
* | | User buffer |
51+
* --------------------------------------------------------
52+
* <=======================extent=========================>
53+
* <===============true extent=============>
54+
*
55+
* 3. B is what we give to MPI_Send (and friends), A is where the
56+
* data starts, and C is where the data ends. In this case:
57+
*
58+
* - extent: C-A
59+
* - true extent: C-A
60+
* - true_lb: negative
61+
*
62+
* A B C
63+
* --------------------------------------------------------
64+
* | | User buffer |
65+
* --------------------------------------------------------
66+
* <=======================extent=========================>
67+
* <======================true extent=====================>
68+
*
69+
* 4. MPI_BOTTOM is what we give to MPI_Send (and friends), B is
70+
* where the data starts, and C is where the data ends. In this
71+
* case:
72+
*
73+
* - extent: C-MPI_BOTTOM
74+
* - true extent: C-B
75+
* - true_lb: [potentially very large] positive
76+
*
77+
* MPI_BOTTOM B C
78+
* --------------------------------------------------------
79+
* | | User buffer |
80+
* --------------------------------------------------------
81+
* <=======================extent=========================>
82+
* <===============true extent=============>
83+
*
84+
* So in all cases, for a temporary buffer, all we need to malloc()
85+
* is a buffer of size true_extent. We therefore need to know two
86+
* pointer values: what value to give to MPI_Send (and friends) and
87+
* what value to give to free(), because they might not be the same.
88+
*
89+
* Clearly, what we give to free() is exactly what was returned from
90+
* malloc(). That part is easy. :-)
91+
*
92+
* What we give to MPI_Send (and friends) is a bit more complicated.
93+
* Let's take the 4 cases from above:
94+
*
95+
* 1. If A is what we give to MPI_Send and A is where the data
96+
* starts, then clearly we give to MPI_Send what we got back from
97+
* malloc().
98+
*
99+
* 2. If B is what we get back from malloc, but we give A to
100+
* MPI_Send, then the buffer range [A,B) represents "dead space"
101+
* -- no data will be put there. So it's safe to give B-true_lb to
102+
* MPI_Send. More specifically, the true_lb is positive, so B-true_lb is
103+
* actually A.
104+
*
105+
* 3. If A is what we get back from malloc, and B is what we give to
106+
* MPI_Send, then the true_lb is negative, so A-true_lb will actually equal
107+
* B.
108+
*
109+
* 4. Although this seems like the weirdest case, it's actually
110+
* quite similar to case #2 -- the pointer we give to MPI_Send is
111+
* smaller than the pointer we got back from malloc().
112+
*
113+
* Hence, in all cases, we give (return_from_malloc - true_lb) to MPI_Send.
114+
*
115+
* This works fine and dandy if we only have (count==1), which we
116+
* rarely do. ;-) So we really need to allocate (true_extent +
117+
* ((count - 1) * extent)) to get enough space for the rest. This may
118+
* be more than is necessary, but it's ok.
119+
*
120+
* Simple, no? :-)
121+
*
122+
*/
123+
124+

ompi/mca/coll/base/coll_base_allgather.c

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -167,19 +167,16 @@ int ompi_coll_base_allgather_intra_bruck(const void *sbuf, int scount,
167167
- copy blocks from shift buffer starting at block [rank] in rbuf.
168168
*/
169169
if (0 != rank) {
170-
ptrdiff_t true_extent, true_lb;
171170
char *free_buf = NULL, *shift_buf = NULL;
171+
ptrdiff_t span, gap;
172172

173-
err = ompi_datatype_get_true_extent(rdtype, &true_lb, &true_extent);
174-
if (MPI_SUCCESS != err) { line = __LINE__; goto err_hndl; }
173+
span = opal_datatype_span(&rdtype->super, (size - rank) * rcount, &gap);
175174

176-
free_buf = (char*) calloc(((true_extent +
177-
((ptrdiff_t)(size - rank) * (ptrdiff_t)rcount - 1) * rext)),
178-
sizeof(char));
175+
free_buf = (char*)calloc(span, sizeof(char));
179176
if (NULL == free_buf) {
180177
line = __LINE__; err = OMPI_ERR_OUT_OF_RESOURCE; goto err_hndl;
181178
}
182-
shift_buf = free_buf - true_lb;
179+
shift_buf = free_buf - gap;
183180

184181
/* 1. copy blocks [0 .. (size - rank - 1)] from rbuf to shift buffer */
185182
err = ompi_datatype_copy_content_same_ddt(rdtype, ((ptrdiff_t)(size - rank) * (ptrdiff_t)rcount),

ompi/mca/coll/base/coll_base_allreduce.c

Lines changed: 10 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -135,8 +135,8 @@ ompi_coll_base_allreduce_intra_recursivedoubling(const void *sbuf, void *rbuf,
135135
int ret, line, rank, size, adjsize, remote, distance;
136136
int newrank, newremote, extra_ranks;
137137
char *tmpsend = NULL, *tmprecv = NULL, *tmpswap = NULL, *inplacebuf = NULL;
138-
ptrdiff_t true_lb, true_extent, lb, extent;
139138
ompi_request_t *reqs[2] = {NULL, NULL};
139+
OPAL_PTRDIFF_TYPE span, gap;
140140

141141
size = ompi_comm_size(comm);
142142
rank = ompi_comm_rank(comm);
@@ -154,12 +154,8 @@ ompi_coll_base_allreduce_intra_recursivedoubling(const void *sbuf, void *rbuf,
154154
}
155155

156156
/* Allocate and initialize temporary send buffer */
157-
ret = ompi_datatype_get_extent(dtype, &lb, &extent);
158-
if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
159-
ret = ompi_datatype_get_true_extent(dtype, &true_lb, &true_extent);
160-
if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
161-
162-
inplacebuf = (char*) malloc(true_extent + (ptrdiff_t)(count - 1) * extent);
157+
span = opal_datatype_span(&dtype->super, count, &gap);
158+
inplacebuf = (char*) malloc(span);
163159
if (NULL == inplacebuf) { ret = -1; line = __LINE__; goto error_hndl; }
164160

165161
if (MPI_IN_PLACE == sbuf) {
@@ -629,9 +625,9 @@ ompi_coll_base_allreduce_intra_ring_segmented(const void *sbuf, void *rbuf, int
629625
int segcount, max_segcount, num_phases, phase, block_count, inbi;
630626
size_t typelng;
631627
char *tmpsend = NULL, *tmprecv = NULL, *inbuf[2] = {NULL, NULL};
632-
ptrdiff_t true_lb, true_extent, lb, extent;
633628
ptrdiff_t block_offset, max_real_segsize;
634629
ompi_request_t *reqs[2] = {NULL, NULL};
630+
OPAL_PTRDIFF_TYPE lb, extent, gap;
635631

636632
size = ompi_comm_size(comm);
637633
rank = ompi_comm_rank(comm);
@@ -649,10 +645,6 @@ ompi_coll_base_allreduce_intra_ring_segmented(const void *sbuf, void *rbuf, int
649645
}
650646

651647
/* Determine segment count based on the suggested segment size */
652-
ret = ompi_datatype_get_extent(dtype, &lb, &extent);
653-
if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
654-
ret = ompi_datatype_get_true_extent(dtype, &true_lb, &true_extent);
655-
if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
656648
ret = ompi_datatype_type_size( dtype, &typelng);
657649
if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
658650
segcount = count;
@@ -685,7 +677,10 @@ ompi_coll_base_allreduce_intra_ring_segmented(const void *sbuf, void *rbuf, int
685677
early_blockcount, late_blockcount );
686678
COLL_BASE_COMPUTE_BLOCKCOUNT( early_blockcount, num_phases, inbi,
687679
max_segcount, k);
688-
max_real_segsize = true_extent + (ptrdiff_t)(max_segcount - 1) * extent;
680+
681+
ret = ompi_datatype_get_extent(dtype, &lb, &extent);
682+
if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
683+
max_real_segsize = opal_datatype_span(&dtype->super, max_segcount, &gap);
689684

690685
/* Allocate and initialize temporary buffers */
691686
inbuf[0] = (char*)malloc(max_real_segsize);
@@ -740,8 +735,8 @@ ompi_coll_base_allreduce_intra_ring_segmented(const void *sbuf, void *rbuf, int
740735
block_count = ((rank < split_rank)? early_blockcount : late_blockcount);
741736
COLL_BASE_COMPUTE_BLOCKCOUNT(block_count, num_phases, split_phase,
742737
early_phase_segcount, late_phase_segcount)
743-
phase_count = ((phase < split_phase)?
744-
(early_phase_segcount) : (late_phase_segcount));
738+
phase_count = ((phase < split_phase)?
739+
(early_phase_segcount) : (late_phase_segcount));
745740
phase_offset = ((phase < split_phase)?
746741
((ptrdiff_t)phase * (ptrdiff_t)early_phase_segcount) :
747742
((ptrdiff_t)phase * (ptrdiff_t)late_phase_segcount + split_phase));

ompi/mca/coll/base/coll_base_alltoall.c

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -43,10 +43,10 @@ mca_coll_base_alltoall_intra_basic_inplace(const void *rbuf, int rcount,
4343
{
4444
mca_coll_base_module_t *base_module = (mca_coll_base_module_t*) module;
4545
int i, j, size, rank, err = MPI_SUCCESS, line;
46+
OPAL_PTRDIFF_TYPE ext, gap;
4647
MPI_Request *preq;
4748
char *tmp_buffer;
4849
size_t max_size;
49-
ptrdiff_t ext, true_lb, true_ext;
5050

5151
/* Initialize. */
5252

@@ -60,14 +60,14 @@ mca_coll_base_alltoall_intra_basic_inplace(const void *rbuf, int rcount,
6060

6161
/* Find the largest receive amount */
6262
ompi_datatype_type_extent (rdtype, &ext);
63-
ompi_datatype_get_true_extent ( rdtype, &true_lb, &true_ext);
64-
max_size = true_ext + ext * (rcount-1);
63+
max_size = opal_datatype_span(&rdtype->super, rcount, &gap);
6564

6665
/* Allocate a temporary buffer */
6766
tmp_buffer = calloc (max_size, 1);
6867
if (NULL == tmp_buffer) {
6968
return OMPI_ERR_OUT_OF_RESOURCE;
7069
}
70+
tmp_buffer -= gap;
7171
max_size = ext * rcount;
7272

7373
/* in-place alltoall slow algorithm (but works) */
@@ -199,7 +199,7 @@ int ompi_coll_base_alltoall_intra_bruck(const void *sbuf, int scount,
199199
int i, k, line = -1, rank, size, err = 0;
200200
int sendto, recvfrom, distance, *displs = NULL, *blen = NULL;
201201
char *tmpbuf = NULL, *tmpbuf_free = NULL;
202-
ptrdiff_t rlb, slb, tlb, sext, rext, tsext;
202+
OPAL_PTRDIFF_TYPE sext, rext, span, gap;
203203
struct ompi_datatype_t *new_ddt;
204204

205205
if (MPI_IN_PLACE == sbuf) {
@@ -213,25 +213,23 @@ int ompi_coll_base_alltoall_intra_bruck(const void *sbuf, int scount,
213213
OPAL_OUTPUT((ompi_coll_base_framework.framework_output,
214214
"coll:base:alltoall_intra_bruck rank %d", rank));
215215

216-
err = ompi_datatype_get_extent (sdtype, &slb, &sext);
216+
err = ompi_datatype_type_extent (sdtype, &sext);
217217
if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl; }
218218

219-
err = ompi_datatype_get_true_extent(sdtype, &tlb, &tsext);
220-
if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl; }
221-
222-
err = ompi_datatype_get_extent (rdtype, &rlb, &rext);
219+
err = ompi_datatype_type_extent (rdtype, &rext);
223220
if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl; }
224221

222+
span = opal_datatype_span(&sdtype->super, size * scount, &gap);
225223

226224
displs = (int *) malloc(size * sizeof(int));
227225
if (displs == NULL) { line = __LINE__; err = -1; goto err_hndl; }
228226
blen = (int *) malloc(size * sizeof(int));
229227
if (blen == NULL) { line = __LINE__; err = -1; goto err_hndl; }
230228

231229
/* tmp buffer allocation for message data */
232-
tmpbuf_free = (char *) malloc(tsext + ((ptrdiff_t)scount * (ptrdiff_t)size - 1) * sext);
230+
tmpbuf_free = (char *)malloc(span);
233231
if (tmpbuf_free == NULL) { line = __LINE__; err = -1; goto err_hndl; }
234-
tmpbuf = tmpbuf_free - slb;
232+
tmpbuf = tmpbuf_free - gap;
235233

236234
/* Step 1 - local rotation - shift up by rank */
237235
err = ompi_datatype_copy_content_same_ddt (sdtype,

ompi/mca/coll/base/coll_base_alltoallv.c

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -38,16 +38,16 @@
3838

3939
int
4040
mca_coll_base_alltoallv_intra_basic_inplace(const void *rbuf, const int *rcounts, const int *rdisps,
41-
struct ompi_datatype_t *rdtype,
42-
struct ompi_communicator_t *comm,
43-
mca_coll_base_module_t *module)
41+
struct ompi_datatype_t *rdtype,
42+
struct ompi_communicator_t *comm,
43+
mca_coll_base_module_t *module)
4444
{
4545
mca_coll_base_module_t *base_module = (mca_coll_base_module_t*) module;
4646
int i, j, size, rank, err=MPI_SUCCESS;
4747
MPI_Request *preq;
4848
char *tmp_buffer;
4949
size_t max_size, rdtype_size;
50-
ptrdiff_t ext;
50+
OPAL_PTRDIFF_TYPE ext, gap;
5151

5252
/* Initialize. */
5353

@@ -63,16 +63,17 @@ mca_coll_base_alltoallv_intra_basic_inplace(const void *rbuf, const int *rcounts
6363
/* Find the largest receive amount */
6464
ompi_datatype_type_extent (rdtype, &ext);
6565
for (i = 0, max_size = 0 ; i < size ; ++i) {
66-
size_t size = ext * rcounts[i];
67-
66+
size_t size = opal_datatype_span(&rdtype->super, rcounts[i], &gap);
6867
max_size = size > max_size ? size : max_size;
6968
}
69+
/* The gap will always be the same as we are working on the same datatype */
7070

7171
/* Allocate a temporary buffer */
7272
tmp_buffer = calloc (max_size, 1);
7373
if (NULL == tmp_buffer) {
7474
return OMPI_ERR_OUT_OF_RESOURCE;
7575
}
76+
tmp_buffer += gap;
7677

7778
/* in-place alltoallv slow algorithm (but works) */
7879
for (i = 0 ; i < size ; ++i) {

ompi/mca/coll/base/coll_base_gather.c

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,8 @@ ompi_coll_base_gather_intra_binomial(const void *sbuf, int scount,
4949
char *ptmp = NULL, *tempbuf = NULL;
5050
ompi_coll_tree_t* bmtree;
5151
MPI_Status status;
52-
MPI_Aint sextent, slb, strue_lb, strue_extent;
53-
MPI_Aint rextent, rlb, rtrue_lb, rtrue_extent;
52+
MPI_Aint sextent, sgap, ssize;
53+
MPI_Aint rextent, rgap, rsize;
5454
mca_coll_base_module_t *base_module = (mca_coll_base_module_t*) module;
5555
mca_coll_base_comm_t *data = base_module->base_data;
5656

@@ -64,14 +64,14 @@ ompi_coll_base_gather_intra_binomial(const void *sbuf, int scount,
6464
COLL_BASE_UPDATE_IN_ORDER_BMTREE( comm, base_module, root );
6565
bmtree = data->cached_in_order_bmtree;
6666

67-
ompi_datatype_get_extent(sdtype, &slb, &sextent);
68-
ompi_datatype_get_true_extent(sdtype, &strue_lb, &strue_extent);
67+
ompi_datatype_type_extent(sdtype, &sextent);
68+
ompi_datatype_type_extent(rdtype, &rextent);
69+
ssize = opal_datatype_span(&sdtype->super, scount * size, &sgap);
70+
rsize = opal_datatype_span(&rdtype->super, rcount * size, &rgap);
6971

7072
vrank = (rank - root + size) % size;
7173

7274
if (rank == root) {
73-
ompi_datatype_get_extent(rdtype, &rlb, &rextent);
74-
ompi_datatype_get_true_extent(rdtype, &rtrue_lb, &rtrue_extent);
7575
if (0 == root){
7676
/* root on 0, just use the recv buffer */
7777
ptmp = (char *) rbuf;
@@ -83,12 +83,12 @@ ompi_coll_base_gather_intra_binomial(const void *sbuf, int scount,
8383
} else {
8484
/* root is not on 0, allocate temp buffer for recv,
8585
* rotate data at the end */
86-
tempbuf = (char *) malloc(rtrue_extent + ((ptrdiff_t)rcount * (ptrdiff_t)size - 1) * rextent);
86+
tempbuf = (char *) malloc(rsize);
8787
if (NULL == tempbuf) {
8888
err= OMPI_ERR_OUT_OF_RESOURCE; line = __LINE__; goto err_hndl;
8989
}
9090

91-
ptmp = tempbuf - rtrue_lb;
91+
ptmp = tempbuf - rgap;
9292
if (sbuf != MPI_IN_PLACE) {
9393
/* copy from sbuf to temp buffer */
9494
err = ompi_datatype_sndrcv((void *)sbuf, scount, sdtype,
@@ -106,12 +106,12 @@ ompi_coll_base_gather_intra_binomial(const void *sbuf, int scount,
106106
/* other non-leaf nodes, allocate temp buffer for data received from
107107
* children, the most we need is half of the total data elements due
108108
* to the property of binimoal tree */
109-
tempbuf = (char *) malloc(strue_extent + ((ptrdiff_t)scount * (ptrdiff_t)size - 1) * sextent);
109+
tempbuf = (char *) malloc(ssize);
110110
if (NULL == tempbuf) {
111111
err= OMPI_ERR_OUT_OF_RESOURCE; line = __LINE__; goto err_hndl;
112112
}
113113

114-
ptmp = tempbuf - strue_lb;
114+
ptmp = tempbuf - sgap;
115115
/* local copy to tempbuf */
116116
err = ompi_datatype_sndrcv((void *)sbuf, scount, sdtype,
117117
ptmp, scount, sdtype);

0 commit comments

Comments
 (0)