Rayturbo数据增强功能提高了处理速度五倍

2025-05-31 15:00 view 帮助

RayTurbo Data Enhancements Boost Processing Speed by Fivefold

Rongchai Wang May 20, 2025 05:17

Anyscale's RayTurbo Data introduces significant improvements, offering up to 5x faster data processing.关键功能包括工作级检查点，矢量化聚合和优化的管道规则。

AnyScale揭示了Rayturbo Data的主要增强功能，Rayturbo Data是一个专有数据处理平台，承诺与其开放式产品对应方相比，最多要快五倍的性能。这些改进旨在通过减少处理时间和操作风险来彻底改变大规模数据处理

提高可靠性的工作级检查点

出色的功能之一是引入工作级检查点，旨在在生产环境中增强可靠性。此功能允许推理工作负载从中断的确切点恢复，无论是由于手动或自动集群关闭而引起的。 By preserving the execution state, RayTurbo Data ensures that costly compute resources are not wasted, maintaining tight delivery schedules and competitive edges.

Unlike the existing Ray Data, which retries individual tasks upon worker node failures, RayTurbo's checkpointing can handle significant disruptions like head node crashes or out-of-memory errors without needing a full restart.这种进步对于长期运行的批处理推理工作尤其有益于处理数百万记录，这些记录以前面临停机时间的数小时或数天。

矢量化聚合以改进数据分析

rayturbo da daTA现在支持完全矢量的聚合，将计算从Python转移到优化的本机代码。这种过渡消除了与Python的解释器相关的性能瓶颈，从而增强了现代CPU体系结构的吞吐量。 The new aggregation capabilities are crucial for feature engineering and data summarization tasks, particularly when dealing with large datasets.

Optimized Pipeline Rules for Efficient Processing

In addition to speed enhancements, RayTurbo Data's optimizer rules have been upgraded to automatically reorder operations within data pipelines, focusing on filter and projection tasks.这种优化减少了不必要的数据处理，使管道可以在不更改用户编写的代码的情况下更迅速完成。

性能基准和Impact基准和Impact

全面的基准重点介绍Rayturbo数据的性能优势，而不是开放的射线数据。在使用TPC-H订单数据集的测试中，Rayturbo证明了1.6倍2.6倍改进聚集的工作负载，并提高3.3倍至4.9倍的提升，以提高涉及过滤器和列选择的任务。

测试环境由一个M7I.4xlarge头节和五个M7I.16xlarge Worker Nodes和Object Store Store Store Node Node node per wortors per per per per per per pre node node node node node node。这些基准测试强调了Rayturbo Data处理大规模AI工作负载的能力，更有效地提供了重要的竞争优势。