About the number of model load times. 

### Describe the issue

 Dear. Currently, each rank loads the complete data of the model and then performs tensor segmentation. For example, if there are eight ranks and eight models are loaded, the memory may be reused, causing a large waste of memory or even memory overflow. Is there a plan to update all ranks and load only one copy of model data?