Data-Juicer：阿里巴巴荣誉出品的大模型数据清洗框架

Diffusion Models专栏文章汇总：入门与实战

前言：如何优雅地进行大规模数据清洗是一门艺术，特别对于大模型，数据的质量是决定模型成功最关键的因素之一。阿里巴巴最近开源了一项专门针对大语言模型和视频生成大模型的数据清洗框架，值得关注！

主要特点

数据处理

分布式数据处理

数据分析

数据可视化

沙盒实验室

视频增强菜谱算子

示例：使用DataJuicer处理视频数据

2.1 克隆data-juicer源代码

2.2 运行data-juicer

预置模型

主要特点

系统化 & 可复用：为用户提供系统化且可复用的80+核心算子，20+配置菜谱和20+专用工具池，旨在让多模态数据处理独立于特定的大语言模型数据集和处理流水线。
数据反馈回路 & 沙盒实验室：支持一站式数据-模型协同开发，通过沙盒实验室快速迭代，基于数据和模型反馈回路、可视化和多维度自动评估等功能，使您更了解和改进您的数据和模型。

效率增强：提供高效并行化的数据处理流水线（Aliyun-PAI\Ray\Slurm\CUDA\算子融合），减少内存占用和CPU开销，提高生产力。

全面的数据处理菜谱：为pre-training、fine-tuning、中英文等场景提供数十种预构建的数据处理菜谱。在LLaMA、LLaVA等模型上有效验证。

用户友好：设计简单易用，提供全面的文档、简易入门指南和演示配置，并且可以轻松地添加/删除现有配置中的算子。
灵活 & 易扩展：支持大多数数据格式（如jsonl、parquet、csv等），并允许灵活组合算子。支持自定义算子，以执行定制化的数据处理。

数据处理

以配置文件路径作为参数来运行 process_data.py 或者 dj-process 命令行工具来处理数据集。

# 适用于从源码安装
python tools/process_data.py --config configs/demo/process.yaml# 使用命令行工具
dj-process --config configs/demo/process.yaml

注意：使用未保存在本地的第三方模型或资源的算子第一次运行可能会很慢，因为这些算子需要将相应的资源下载到缓存目录中。默认的下载缓存目录为~/.cache/data_juicer。您可通过设置 shell 环境变量 DATA_JUICER_CACHE_HOME 更改缓存目录位置，您也可以通过同样的方式更改 DATA_JUICER_MODELS_CACHE 或 DATA_JUICER_ASSETS_CACHE 来分别修改模型缓存或资源缓存目录:
注意：对于使用了第三方模型的算子，在填写config文件时需要去声明其对应的mem_required（可以参考config_all.yaml文件中的设置）。Data-Juicer在运行过程中会根据内存情况和算子模型所需的memory大小来控制对应的进程数，以达成更好的数据处理的性能效率。而在使用CUDA环境运行时，如果不正确的声明算子的mem_required情况，则有可能导致CUDA Out of Memory。

# 缓存主目录
export DATA_JUICER_CACHE_HOME="/path/to/another/directory"
# 模型缓存目录
export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models"
# 资源缓存目录
export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets"

分布式数据处理

Data-Juicer 现在基于RAY实现了多机分布式数据处理。对应Demo可以通过如下命令运行：

# 运行文字数据处理
python tools/process_data.py --config ./demos/process_on_ray/configs/demo.yaml# 运行视频数据处理
python tools/process_data.py --config ./demos/process_video_on_ray/configs/demo.yaml

如果需要在多机上使用RAY执行数据处理，需要确保所有节点都可以访问对应的数据路径，即将对应的数据路径挂载在共享文件系统（如NAS）中。
RAY 模式下的去重算子与单机版本不同，所有 RAY 模式下的去重算子名称都以 ray 作为前缀，例如 ray_video_deduplicator 和 ray_document_deduplicator。这些去重算子依赖于 Redis 实例.因此使用前除启动 RAY 集群外还需要启动 Redis 实例，并在对应的配置文件中填写 Redis 实例的 host 和 port。

用户也可以不使用 RAY，拆分数据集后使用 Slurm / 阿里云 PAI-DLC 在集群上运行，此时使用不包含 RAY 的原版 Data-Juicer 即可。

数据分析

以配置文件路径为参数运行 analyze_data.py 或者 dj-analyze 命令行工具来分析数据集。

# 适用于从源码安装
python tools/analyze_data.py --config configs/demo/analyser.yaml# 使用命令行工具
dj-analyze --config configs/demo/analyser.yaml

注意：Analyser 只计算 Filter 算子的状态，其他的算子（例如 Mapper 和 Deduplicator）会在分析过程中被忽略。

数据可视化

运行 app.py 来在浏览器中可视化您的数据集。
注意：只可用于从源码安装的方法。

streamlit run app.py

沙盒实验室

数据沙盒实验室 (DJ-Sandbox) 为用户提供了持续生产数据菜谱的最佳实践，其具有低开销、可迁移、有指导性等特点。

用户在沙盒中可以基于一些小规模数据集、模型对数据菜谱进行快速实验、迭代、优化，再迁移到更大尺度上，大规模生产高质量数据以服务大模型。
用户在沙盒中，除了Data-Juicer基础的数据优化与数据菜谱微调功能外，还可以便捷地使用数据洞察与分析、沙盒模型训练与评测、基于数据和模型反馈优化数据菜谱等可配置组件，共同组成完整的一站式数据-模型研发流水线。

沙盒默认通过如下命令运行，更多介绍和细节请参阅沙盒文档.

python tools/sandbox_starter.py --config configs/demo/sandbox/sandbox.yaml

视频增强菜谱算子

# Process config example including:
#   - all global arguments
#   - all ops and their arguments# global parameters
project_name: 'all'                                         # project name for distinguish your configs
dataset_path: '/path/to/a/video-text/dataset.jsonl'# accepted format: 'weight1(optional) dataset1-path weight2(optional) dataset2-path'
export_path: '/path/to/store/refined/dataset.jsonl'
np: 48                                                       # number of subprocess to process your dataset# Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of `text_keys` when you set multiple keys.
open_tracer: true                                          # whether to open the tracer to trace the changes during process. It might take more time when opening tracer# for multimodal data processing
video_key: 'videos'                                         # key name of field to store the list of sample video paths.
video_special_token: '<__dj__video>'                        # the special token that represents a video in the text. In default, it's "<__dj__video>". You can specify your own special token according to your input dataset.eoc_special_token: '<|__dj__eoc|>'                          # the special token that represents the end of a chunk in the text. In default, it's "<|__dj__eoc|>". You can specify your own special token according to your input dataset.# process schedule: a list of several process operators with their arguments
# hyperparameters are set according to the 3-sigma stats on MSR-VTT dataset
process:- language_id_score_filter:                               # filter text in specific language with language scores larger than a specific max valuelang: en                                                # keep text in what languagemin_score: 0.26311219                                   # the min language scores to filter text- perplexity_filter:                                      # filter text with perplexity score out of specific rangelang: en                                                # compute perplexity in what languagemax_ppl: 7376.81378                                     # the max perplexity score to filter text- video_aesthetics_filter:                                # filter samples according to the aesthetics score of frame images extracted from videos.hf_scorer_model: shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE # Huggingface model name for the aesthetics predictormin_score: 0.31767486                                   # the min aesthetics score of filter rangemax_score: 1.0                                          # the max aesthetics score of filter rangeframe_sampling_method: 'uniform'                        # sampling method of extracting frame images from the videos. Should be one of ["all_keyframe", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "uniform" with frame_num=3, considering that the number of keyframes can be large while their difference is usually small in terms of their aesthetics.frame_num: 3                                            # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.reduce_mode: avg                                        # reduce mode to the all frames extracted from videos, must be one of ['avg','max', 'min'].any_or_all: any                                         # keep this sample when any/all images meet the filter condition- video_frames_text_similarity_filter:                    # keep samples those similarities between sampled video frame images and text within a specific range.hf_clip: openai/clip-vit-base-patch32                   # clip model name on huggingface to compute the similarity between frame image and text. It's kind of language-related. For example, for Chinese datasets, ChineseCLIP might be a better choice.min_score: 0.16571071                                   # the min similarity to keep samples.max_score: 1.0                                          # the max similarity to keep samples.frame_sampling_method: all_keyframes                    # sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".frame_num: 3                                            # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.horizontal_flip: false                                  # flip frame image horizontally (left to right).vertical_flip: false                                    # flip frame image vertically (top to bottom).reduce_mode: avg                                        # reduce mode when one text corresponds to multiple videos in a chunk,  must be one of ['avg','max', 'min'].any_or_all: any                                         # keep this sample when any/all videos meet the filter condition- video_motion_score_filter:                              # Keep samples with video motion scores within a specific range.min_score: 0.25                                         # the minimum motion score to keep samplesmax_score: 10000.0                                      # the maximum motion score to keep samplessampling_fps: 2                                         # the samplig rate of frames_per_second to compute optical flowany_or_all: any                                         # keep this sample when any/all videos meet the filter condition- video_nsfw_filter:                                      # filter samples according to the nsfw scores of videos in themhf_nsfw_model: Falconsai/nsfw_image_detection           # Huggingface model name for nsfw classificationscore_threshold: 0.34847191                             # the nsfw score threshold for samples, range from 0 to 1frame_sampling_method: all_keyframes                    # sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".frame_num: 3                                            # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.reduce_mode: avg                                        # reduce mode for multiple sampled video frames to compute nsfw scores of videos, must be one of ['avg','max', 'min'].any_or_all: any                                         # keep this sample when any/all images meet the filter condition- video_watermark_filter:                                 # filter samples according to the predicted watermark probabilities of videos in themhf_watermark_model: amrul-hzz/watermark_detector        # Huggingface model name for watermark classificationprob_threshold: 0.96510297                              # the predicted watermark probability threshold for samples, range from 0 to 1frame_sampling_method: all_keyframes                    # sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".frame_num: 3                                            # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.reduce_mode: avg                                        # reduce mode for multiple sampled video frames to compute final predicted watermark probabilities of videos, must be one of ['avg','max', 'min'].any_or_all: any                                         # keep this sample when any/all images meet the filter condition

示例：使用DataJuicer处理视频数据

2.1 克隆data-juicer源代码

# 如已经下载，可跳过此步骤
!cd dj_sora_challenge/toolkit && git clone https://github.com/modelscope/data-juicer

2.2 运行data-juicer

DataJuicer 通过config文件来指定进行数据处理的算子，您可以根据需要使用不同的算子/参数组合来调节数据预处理的链路。

为了方便您更好地体验DJ-SORA，运行下面的代码来下载一个供参考的config文件。

样例展示了使用datajuicer进行文本过滤 (video_ocr_area_ratio_filter) + 美学过滤 (video_aesthetics_filter) 的例子。

算子的详细介绍在 DJ-SORA官方文档： data-juicer/docs/DJ_SORA_ZH.md at main · modelscope/data-juicer · GitHub

您可在 data-juicer/data_juicer/ops 文件夹中查看相关算子定义及参数含义，并进行相应的修改来调整数据预处理链路。

# 下载参考的配置文件（无需重复执行）
!wget https://pai-vision-data-wlcb.oss-cn-wulanchabu.aliyuncs.com/aigc-data/easyanimate/config/demo_config.yaml

# 下载相关的模型文件 （无需重复执行）
dj_path = os.path.join(os.getcwd(),'dj_sora_challenge/toolkit/data-juicer')
download_dj_preprocess_model('video_ocr_area_ratio_filter', dj_path)
download_dj_preprocess_model('video_aesthetics_filter', dj_path)

# 数据预处理 （修改配置文件来执行不同的数据预处理链路）
dj_path = os.path.join(os.getcwd(),'dj_sora_challenge/toolkit/data-juicer')
config_path = os.path.join(os.getcwd(), 'demo_config.yaml')
!cd {dj_path} && PATHPATH=./ python tools/process_data.py --config {config_path}# 预处理后的结果存放在 dj_sora_challenge/output/processed_data/processed_data.jsonl

预置模型

为了方便您使用DJ，我们提供了常用算子的预置模型，您可通过调用 download_dj_preprocess_model(op_name, dj_path) 使用。

预置模型列表：

算子名称	功能	地址
video_ocr_area_ratio_filter	移除文本区域过大的样本	https://pai-vision-data-wlcb.oss-cn-wulanchabu.aliyuncs.com/aigc-data/easyanimate/models/preprocess/craft_mlt_25k.pth
video_aesthetics_filter	拆帧后，进行美学度打分过滤	https://pai-vision-data-wlcb.oss-cn-wulanchabu.aliyuncs.com/aigc-data/easyanimate/models/preprocess/models_aesthetic.tar.gz
video_frames_text_similarity_filter	在时空一致性维度过滤，计算关键/指定帧和文本的匹配分	https://pai-vision-data-wlcb.oss-cn-wulanchabu.aliyuncs.com/aigc-data/easyanimate/models/preprocess/models_clip.tar.gz
video_nsfw_filter	移除不合规的样本	https://pai-vision-data-wlcb.oss-cn-wulanchabu.aliyuncs.com/aigc-data/easyanimate/models/preprocess/models_nsfw.tar.gz
video_watermark_filter	移除带水印的样本	https://pai-vision-data-wlcb.oss-cn-wulanchabu.aliyuncs.com/aigc-data/easyanimate/models/preprocess/models_watermark.tar.gz
video_tagging_from_frames_filter	轻量图生文模型，密集帧生成空间概要信息	https://pai-vision-data-wlcb.oss-cn-wulanchabu.aliyuncs.com/aigc-data/easyanimate/models/preprocess/ram_plus_swin_large_14m.pth https://pai-vision-data-wlcb.oss-cn-wulanchabu.aliyuncs.com/aigc-data/easyanimate/models/preprocess/models_bert.tar.gz
video_captioning_from_frames_mapper	更重的图生文模型，少量帧生成更详细空间信息	https://pai-vision-data-wlcb.oss-cn-wulanchabu.aliyuncs.com/aigc-data/easyanimate/models/preprocess/models_blip.tar.gz
video_captioning_from_video_mapper	视频生文模型，连续帧生成时序信息	https://pai-vision-data-wlcb.oss-cn-wulanchabu.aliyuncs.com/aigc-data/easyanimate/models/preprocess/models_video_blip.tar.gz

# 下载预置模型
# download_dj_preprocess_model('video_nsfw_filter', dj_path)
# download_dj_preprocess_model('video_frames_text_similarity_filter', dj_path)
# download_dj_preprocess_model('video_watermark_filter', dj_path)
# download_dj_preprocess_model('video_tagging_from_frames_mapper', dj_path)
# download_dj_preprocess_model('video_captioning_from_frames_mapper', dj_path)
# download_dj_preprocess_model('video_captioning_from_video_mapper', dj_path)

项目地址：

GitHub - modelscope/data-juicer: A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据！