نتائج البحث - "Zhou, Zhenglei"

1

تقرير

AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes

المؤلفون: Xiao, Youshao, Ju, Lin, Zhou, Zhenglei, Li, Siyuan, Huan, Zhaoxin, Zhang, Dalong, Jiang, Rujie, Wang, Lin, Zhang, Xiaolu, Liang, Lei, Zhou, Jun

مصطلحات موضوعية: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning

الوصف: Many distributed training techniques like Parameter Server and AllReduce have been proposed to take advantage of the increasingly large data and rich features. However, stragglers frequently occur in distributed training due to resource contention and hardware heterogeneity, which significantly hampers the training efficiency. Previous works only address part of the stragglers and could not adaptively solve various stragglers in practice. Additionally, it is challenging to use a systematic framework to address all stragglers because different stragglers require diverse data allocation and fault-tolerance mechanisms. Therefore, this paper proposes a unified distributed training framework called AntDT (Ant Distributed Training Framework) to adaptively solve the straggler problems. Firstly, the framework consists of four components, including the Stateful Dynamic Data Sharding service, Monitor, Controller, and Agent. These components work collaboratively to efficiently distribute workloads and provide a range of pre-defined straggler mitigation methods with fault tolerance, thereby hiding messy details of data allocation and fault handling. Secondly, the framework provides a high degree of flexibility, allowing for the customization of straggler mitigation solutions based on the specific circumstances of the cluster. Leveraging this flexibility, we introduce two straggler mitigation solutions, namely AntDT-ND for non-dedicated clusters and AntDT-DD for dedicated clusters, as practical examples to resolve various types of stragglers at Ant Group. Justified by our comprehensive experiments and industrial deployment statistics, AntDT outperforms other SOTA methods more than 3x in terms of training efficiency. Additionally, in Alipay's homepage recommendation scenario, using AntDT reduces the training duration of the ranking model from 27.8 hours to just 5.4 hours.

الوصول الحر: http://arxiv.org/abs/2404.09679Test

View record in Arxiv

2

تقرير

G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

المؤلفون: Xiao, Youshao, Zhao, Shangchun, Zhou, Zhenglei, Huan, Zhaoxin, Ju, Lin, Zhang, Xiaolu, Wang, Lin, Zhou, Jun

مصطلحات موضوعية: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Information Retrieval

الوصف: Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not tailored for meta learning based DLRM models and have critical problems regarding efficiency in distributed training in the GPU cluster. It is because the conventional deep learning pipeline is not optimized for two task-specific datasets and two update loops in meta learning. This paper provides a high-performance framework for large-scale training for Optimization-based Meta DLRM models over the \textbf{G}PU cluster, namely \textbf{G}-Meta. Firstly, G-Meta utilizes both data parallelism and model parallelism with careful orchestration regarding computation and communication efficiency, to enable high-speed distributed training. Secondly, it proposes a Meta-IO pipeline for efficient data ingestion to alleviate the I/O bottleneck. Various experimental results show that G-Meta achieves notable training speed without loss of statistical performance. Since early 2022, G-Meta has been deployed in Alipay's core advertising and recommender system, shrinking the continuous delivery of models by four times. It also obtains 6.48\% improvement in Conversion Rate (CVR) and 1.06\% increase in CPM (Cost Per Mille) in Alipay's homepage display advertising, with the benefit of larger training samples and tasks.

الوصول الحر: http://arxiv.org/abs/2401.04338Test

View record in Arxiv

3

تقرير

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

المؤلفون: Xiao, Youshao, Wu, Weichang, Zhou, Zhenglei, Mao, Fagui, Zhao, Shangchun, Ju, Lin, Liang, Lei, Zhang, Xiaolu, Zhou, Jun

مصطلحات موضوعية: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language

الوصف: Recently, ChatGPT or InstructGPT like large language models (LLM) has made a significant impact in the AI world. Many works have attempted to reproduce the complex InstructGPT's training pipeline, namely Reinforcement Learning with Human Feedback (RLHF). However, the mainstream distributed RLHF training methods typically adopt a fixed model placement strategy, referred to as the Flattening strategy. This strategy treats all four interdependent models involved in RLHF as a single entity, distributing them across all devices and applying parallelism techniques designed for a single model, regardless of the different workloads inherent to each model. As a result, this strategy exacerbates the generation bottlenecks in the RLHF training and degrades the overall training efficiency. To address these issues, we propose an adaptive model placement framework that offers two flexible model placement strategies. The Interleaving strategy helps reduce memory redundancy and communication costs of RLHF training by placing models without dependencies on exclusive devices with careful orchestration. On the other hand, the Separation strategy improves the throughput of model training by separating the training and inference runtime of the RLHF pipeline with additional shadow models. Furthermore, our framework provides a simple user interface and allows for the agile allocation of models across devices in a fine-grained manner for various training scenarios, involving models of varying sizes and devices of different scales. Extensive experiments have demonstrated that our Interleaving and Separation strategies can achieve notable improvements up to 11X, compared to the current SOTA approaches. The results highlight the effectiveness and adaptability of our approaches in accelerating the training of distributed RLHF.

الوصول الحر: http://arxiv.org/abs/2312.11819Test

View record in Arxiv