如何粗暴地下载huggingface_hub指定数据文件

参考这里:
https://huggingface.co/docs/huggingface_hub/guides/download

可见下载单个文件,下载整个仓库文件都是可行的。

这是使用snapshot_download下载的一个例子:

https://qq742971636.blog.csdn.net/article/details/135150482

snapshot_download 会利用 hf_hub_download 函数实现下载所有文件,hf_hub_download是负责下载单个文件的。

snapshot_download 函数有点sb的一点是,你的本地目录已经存在了某个文件,你运行snapshot_download之后就重复下载。一看介绍:如果local_dir_use_symlinks=False并且blob文件不在缓存目录中,那么文件会被下载并直接放在local_dir下。这意味着如果您稍后需要重新下载它们,它们将被完全重新下载。

    If `local_dir` is provided, the file structure from the repo will be replicated in this location. You can configurehow you want to move those files:- If `local_dir_use_symlinks="auto"` (default), files are downloaded and stored in the cache directory as blobfiles. Small files (<5MB) are duplicated in `local_dir` while a symlink is created for bigger files. The goalis to be able to manually edit and save small files without corrupting the cache while saving disk space forbinary files. The 5MB threshold can be configured with the `HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD`environment variable.- If `local_dir_use_symlinks=True`, files are downloaded, stored in the cache directory and symlinked in `local_dir`.This is optimal in term of disk usage but files must not be manually edited.- If `local_dir_use_symlinks=False` and the blob files exist in the cache directory, they are duplicated in thelocal dir. This means disk usage is not optimized.- Finally, if `local_dir_use_symlinks=False` and the blob files do not exist in the cache directory, then thefiles are downloaded and directly placed under `local_dir`. This means if you need to download them again later,they will be re-downloaded entirely.

什么乱七八糟的,必须将filtered_repo_files文件里剔除掉已经存在的,不然每次都重复下就不用玩了,选择直接改源码。

在这里插入图片描述

修改文件/opt/xiedong/miniconda3/envs/datacomp/lib/python3.10/site-packages/huggingface_hub/_snapshot_download.py


import os
from pathlib import Path
from typing import Dict, List, Literal, Optional, Unionfrom tqdm.auto import tqdm as base_tqdm
from tqdm.contrib.concurrent import thread_mapfrom .constants import (DEFAULT_ETAG_TIMEOUT,DEFAULT_REVISION,HF_HUB_CACHE,HF_HUB_ENABLE_HF_TRANSFER,REPO_TYPES,
)
from .file_download import REGEX_COMMIT_HASH, hf_hub_download, repo_folder_name
from .hf_api import HfApi
from .utils import filter_repo_objects, logging, validate_hf_hub_args
from .utils import tqdm as hf_tqdmlogger = logging.get_logger(__name__)@validate_hf_hub_args
def snapshot_download(repo_id: str,*,repo_type: Optional[str] = None,revision: Optional[str] = None,cache_dir: Union[str, Path, None] = None,local_dir: Union[str, Path, None] = None,local_dir_use_symlinks: Union[bool, Literal["auto"]] = "auto",library_name: Optional[str] = None,library_version: Optional[str] = None,user_agent: Optional[Union[Dict, str]] = None,proxies: Optional[Dict] = None,etag_timeout: float = DEFAULT_ETAG_TIMEOUT,resume_download: bool = False,force_download: bool = False,token: Optional[Union[bool, str]] = None,local_files_only: bool = False,allow_patterns: Optional[Union[List[str], str]] = None,ignore_patterns: Optional[Union[List[str], str]] = None,max_workers: int = 8,tqdm_class: Optional[base_tqdm] = None,endpoint: Optional[str] = None,
) -> str:"""Download repo files.Download a whole snapshot of a repo's files at the specified revision. This is useful when you want all files froma repo, because you don't know which ones you will need a priori. All files are nested inside a folder in orderto keep their actual filename relative to that folder. You can also filter which files to download using`allow_patterns` and `ignore_patterns`.If `local_dir` is provided, the file structure from the repo will be replicated in this location. You can configurehow you want to move those files:- If `local_dir_use_symlinks="auto"` (default), files are downloaded and stored in the cache directory as blobfiles. Small files (<5MB) are duplicated in `local_dir` while a symlink is created for bigger files. The goalis to be able to manually edit and save small files without corrupting the cache while saving disk space forbinary files. The 5MB threshold can be configured with the `HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD`environment variable.- If `local_dir_use_symlinks=True`, files are downloaded, stored in the cache directory and symlinked in `local_dir`.This is optimal in term of disk usage but files must not be manually edited.- If `local_dir_use_symlinks=False` and the blob files exist in the cache directory, they are duplicated in thelocal dir. This means disk usage is not optimized.- Finally, if `local_dir_use_symlinks=False` and the blob files do not exist in the cache directory, then thefiles are downloaded and directly placed under `local_dir`. This means if you need to download them again later,they will be re-downloaded entirely.An alternative would be to clone the repo but this requires git and git-lfs to be installed and properlyconfigured. It is also not possible to filter which files to download when cloning a repository using git.Args:repo_id (`str`):A user or an organization name and a repo name separated by a `/`.repo_type (`str`, *optional*):Set to `"dataset"` or `"space"` if downloading from a dataset or space,`None` or `"model"` if downloading from a model. Default is `None`.revision (`str`, *optional*):An optional Git revision id which can be a branch name, a tag, or acommit hash.cache_dir (`str`, `Path`, *optional*):Path to the folder where cached files are stored.local_dir (`str` or `Path`, *optional*):If provided, the downloaded files will be placed under this directory, either as symlinks (default) orregular files (see description for more details).local_dir_use_symlinks (`"auto"` or `bool`, defaults to `"auto"`):To be used with `local_dir`. If set to "auto", the cache directory will be used and the file will be eitherduplicated or symlinked to the local directory depending on its size. It set to `True`, a symlink will becreated, no matter the file size. If set to `False`, the file will either be duplicated from cache (ifalready exists) or downloaded from the Hub and not cached. See description for more details.library_name (`str`, *optional*):The name of the library to which the object corresponds.library_version (`str`, *optional*):The version of the library.user_agent (`str`, `dict`, *optional*):The user-agent info in the form of a dictionary or a string.proxies (`dict`, *optional*):Dictionary mapping protocol to the URL of the proxy passed to`requests.request`.etag_timeout (`float`, *optional*, defaults to `10`):When fetching ETag, how many seconds to wait for the server to senddata before giving up which is passed to `requests.request`.resume_download (`bool`, *optional*, defaults to `False):If `True`, resume a previously interrupted download.force_download (`bool`, *optional*, defaults to `False`):Whether the file should be downloaded even if it already exists in the local cache.token (`str`, `bool`, *optional*):A token to be used for the download.- If `True`, the token is read from the HuggingFace configfolder.- If a string, it's used as the authentication token.local_files_only (`bool`, *optional*, defaults to `False`):If `True`, avoid downloading the file and return the path to thelocal cached file if it exists.allow_patterns (`List[str]` or `str`, *optional*):If provided, only files matching at least one pattern are downloaded.ignore_patterns (`List[str]` or `str`, *optional*):If provided, files matching any of the patterns are not downloaded.max_workers (`int`, *optional*):Number of concurrent threads to download files (1 thread = 1 file download).Defaults to 8.tqdm_class (`tqdm`, *optional*):If provided, overwrites the default behavior for the progress bar. Passedargument must inherit from `tqdm.auto.tqdm` or at least mimic its behavior.Note that the `tqdm_class` is not passed to each individual download.Defaults to the custom HF progress bar that can be disabled by setting`HF_HUB_DISABLE_PROGRESS_BARS` environment variable.Returns:Local folder path (string) of repo snapshot<Tip>Raises the following errors:- [`EnvironmentError`](https://docs.python.org/3/library/exceptions.html#EnvironmentError)if `token=True` and the token cannot be found.- [`OSError`](https://docs.python.org/3/library/exceptions.html#OSError) ifETag cannot be determined.- [`ValueError`](https://docs.python.org/3/library/exceptions.html#ValueError)if some parameter value is invalid</Tip>"""if cache_dir is None:cache_dir = HF_HUB_CACHEif revision is None:revision = DEFAULT_REVISIONif isinstance(cache_dir, Path):cache_dir = str(cache_dir)if repo_type is None:repo_type = "model"if repo_type not in REPO_TYPES:raise ValueError(f"Invalid repo type: {repo_type}. Accepted repo types are: {str(REPO_TYPES)}")storage_folder = os.path.join(cache_dir, repo_folder_name(repo_id=repo_id, repo_type=repo_type))print(f"storage_folder: {storage_folder}")# if we have no internet connection we will look for an# appropriate folder in the cache# If the specified revision is a commit hash, look inside "snapshots".# If the specified revision is a branch or tag, look inside "refs".if local_files_only:if REGEX_COMMIT_HASH.match(revision):commit_hash = revisionelse:# retrieve commit_hash from fileref_path = os.path.join(storage_folder, "refs", revision)with open(ref_path) as f:commit_hash = f.read()snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)if os.path.exists(snapshot_folder):return snapshot_folderraise ValueError("Cannot find an appropriate cached snapshot folder for the specified"" revision on the local disk and outgoing traffic has been disabled. To"" enable repo look-ups and downloads online, set 'local_files_only' to"" False.")print(f"revision: {revision}")# if we have internet connection we retrieve the correct folder name from the huggingface apiapi = HfApi(library_name=library_name, library_version=library_version, user_agent=user_agent, endpoint=endpoint)repo_info = api.repo_info(repo_id=repo_id, repo_type=repo_type, revision=revision, token=token)assert repo_info.sha is not None, "Repo info returned from server must have a revision sha."assert repo_info.siblings is not None, "Repo info returned from server must have a siblings list."filtered_repo_files = list(filter_repo_objects(items=[f.rfilename for f in repo_info.siblings],allow_patterns=allow_patterns,ignore_patterns=ignore_patterns,))# print(f"filtered_repo_files: {filtered_repo_files}")commit_hash = repo_info.shaprint(f"commit_hash: {commit_hash}")snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)print(f"snapshot_folder: {snapshot_folder}")# if passed revision is not identical to commit_hash# then revision has to be a branch name or tag name.# In that case store a ref.if revision != commit_hash:ref_path = os.path.join(storage_folder, "refs", revision)os.makedirs(os.path.dirname(ref_path), exist_ok=True)with open(ref_path, "w") as f:f.write(commit_hash)print(f"ref_path: {ref_path}")# we pass the commit_hash to hf_hub_download# so no network call happens if we already# have the file locally.def _inner_hf_hub_download(repo_file: str):return hf_hub_download(repo_id,filename=repo_file,repo_type=repo_type,revision=commit_hash,endpoint=endpoint,cache_dir=cache_dir,local_dir=local_dir,local_dir_use_symlinks=local_dir_use_symlinks,library_name=library_name,library_version=library_version,user_agent=user_agent,proxies=proxies,etag_timeout=etag_timeout,resume_download=resume_download,force_download=force_download,token=token,)# 处理filtered_repo_files,删除在local_dir中存在的文件files_exist_in_local_dir = os.listdir(local_dir)# print(f"files_exist_in_local_dir: {files_exist_in_local_dir}")filtered_repo_files = [file for file in filtered_repo_files if file not in files_exist_in_local_dir]print("sorted filtered_repo_files")filtered_repo_files.sort()print(f"len(filtered_repo_files): {len(filtered_repo_files)}")if len(filtered_repo_files) == 0:raise ValueError("No files to download. Please check that the allow_patterns and ignore_patterns"" parameters are correct.")if HF_HUB_ENABLE_HF_TRANSFER:print("单线程下载")# print(f"filtered_repo_files: {filtered_repo_files}")# when using hf_transfer we don't want extra parallelism# from the one hf_transfer providesfor file in filtered_repo_files:_inner_hf_hub_download(file)else:print("多线程下载")print(f"max_workers: {max_workers}")# print(f"filtered_repo_files: {filtered_repo_files}")thread_map(_inner_hf_hub_download,filtered_repo_files,desc=f"Fetching {len(filtered_repo_files)} files",max_workers=max_workers,# User can use its own tqdm class or the default one from `huggingface_hub.utils`tqdm_class=tqdm_class or hf_tqdm,)if local_dir is not None:return str(os.path.realpath(local_dir))return snapshot_folder

这样略过那些已经下载了的文件,就舒服多了。

下载完成后再用hash5检查一下文件是不是都下载对劲了就行,不对劲的使用hf_hub_download直接下载单个文件。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/349186.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

顺序表输入栈元素c语言,C语言数据结构之栈简单操作

C语言数据结构之栈简单操作实验&#xff1a;编写一个程序实现顺序栈的各种基本运算&#xff0c;并在此基础上设计一个主程序&#xff0c;完成如下功能&#xff1a;(1)初始化顺序栈(2)插入元素(3)删除栈顶元素(4)取栈顶元素(5)遍历顺序栈(6)置空顺序栈分析:栈的顺序存储结构简称…

rete_Rete之外的生活– RIP Rete 2013 :)

rete我只是对我的新算法做最后的修改。 它融合了Leaps &#xff0c; 面向集合的Match和Left / Right取消链接的概念 &#xff0c;以及我自己的一些想法。 该代码已提交&#xff0c;但我正在积累工作并编写更多测试。 我将在一周左右的时间内写一个完整的博客&#xff0c;详细介…

25q64存储多个数据_一篇文章看懂,存储虚拟化在不同用例中的实践与优势

存储虚拟化是一种对物理存储资源进行抽象的技术&#xff0c;使其看起来像是一个集中的资源。虚拟化掩盖了管理内存、网络、服务器和存储中资源的复杂性。存储虚拟化运行在多个存储设备上&#xff0c;使它们看起来就像一个单一的存储池。这些池化的存储设备可以来自不同的供应商…

android代码画出波浪球,Android绘制波浪曲线,效果很赞的。

github地址&#xff1a;https://github.com/sddyljsx/Android-SurfView-WaveViewpackage neal.canvas;import android.content.Context;import android.graphics.Canvas;import android.graphics.Color;import android.graphics.Paint;import android.graphics.Path;import and…

Java命令行界面(第14部分):google-options

google-options的GitHub页面指出google-options是“来自Google&#xff08;java&#xff09;的人们的命令行参数解析库。” 该页面继续说&#xff1a;“这是Bazel Project中的命令行参数解析器。 com.google.devtools.common.options程序包已拆分为一个单独的jar&#xff0c;用…

python自动化工具哪个好用_10款好用的自动化测试工具推荐

当我们功能测试干的时间比较久了,或者想要学习更多的技术,提升自己的时候,基本上第一时间就会想到的是自动化测试。而在自动化测试领域&#xff0c;自动化工具的核心地位毋庸置疑&#xff0c;下面为大家推荐10款常见常用的自动化测试工具&#xff1a;1、SeleniumWEB自动化测试S…

android 输入法文本选择功能,Android的文本和输入---创建输入法(一)

输入法编辑器(IME)是让用户输入文本的控件。Android提供了一个可扩展的的输入法的框架&#xff0c;它允许应用程序给用户提供另外的输入法&#xff0c;如软键盘或语音输入。这些输入法一旦安装&#xff0c;用户就可以从系统的设置中选择他们想要使用的IME&#xff0c;并且这个设…

python基础list_python基础操作---list

1 #coding:utf-82 list1 [physics, chemistry, 1997, 2000];3 list2 [1, 2, 3, 4, 5 ];4 list3 ["a", "b", "c", "d"];56 #切片功能跟str一样7 print "list1[0]: ", list1[0]8 print "list2[1:5]: ", list2[1:…

华为mate40RS能升级鸿蒙,mate40Pro和40RS能用上鸿蒙系统吗

[分享交流]mate40Pro和40RS能用上鸿蒙系统吗8886电梯直达huafen210861086新学乍练发表于 2020-12-18 12:30:08来自&#xff1a;HUAWEI Mate 40 Pro最新回复 2020-12-19 09:50:21如题好多人都说不能用上鸿蒙系统林泽徐独步江湖发表于 2020-12-18 12:30:52来自&#xff1a;HUAWEI…

在JShell中尝试Java9 HTTP客户端和Process API

这篇文章继续了My My Java 9 Features博客文章中对Java9功能的探索。 在这里&#xff0c;我们用在Java9 HTTP / 2客户端和进程API试验JShell HTTP / 2客户端 HTTP / 2客户端是Java9中的孵化器项目。 这意味着该API尚未最终确定&#xff0c;因此在将来的版本中仍有一定的更改范…

python怎么读取pdf文件_Python解析并读取PDF文件内容的方法

本文实例讲述了Python解析并读取PDF文件内容的方法。分享给大家供大家参考&#xff0c;具体如下&#xff1a;一、问题描述利用python&#xff0c;去读取pdf文本内容。二、效果三、运行环境python2.7四、需要安装的库pip install pdfminer五、实现源代码代码1(win64)# codingutf…

android记事本添加图片功能,安卓手机上有什么便签app既可以写日记又可以添加照片?...

原标题&#xff1a;安卓手机上有什么便签app既可以写日记又可以添加照片&#xff1f;当前&#xff0c;有很多人一直保持着写日记的习惯&#xff0c;因为这样可以及时记录自己的成长轨迹&#xff0c;使得自己可以追寻到时光的记忆&#xff0c;但是我们记录日记的工具&#xff0c…

python每行输出14个数_python – 计算pandas中每行的一些值的列数

需要更改isnull到notnull&#xff1a;#if first columns is not index, set itdata data.set_index(Site code)data[Count] data.notnull().sum(axis1)data data.set_index(Site code)data[Count] data.count(axis1)print (data)Col1 Col2 Col3 CountSite codeA5252 24.0 5…

Java命令行界面(第20部分):JSAP

JSAP &#xff08; Java Simple Argument Parser &#xff09;2.1是本系列文章的第二十篇&#xff0c;重点是处理Java的命令行参数。 JSAP页面描述了该库存在的原因&#xff1a;“我在Internet上找到了几个解析器&#xff0c;所有解析器都处理了开关&#xff0c;但是在解析返回…

小米平板4android软件兼容吗,小米平板4有NFC功能吗 小米平板4支持NFC吗

小米平板4有NFC吗&#xff1f;小米平板4终于在诸多期待之下发布了&#xff0c;总的来说&#xff0c;小米平板4是一款性价比非常高的产品。所以没有带来太多惊喜的地方&#xff0c;但整体表现还是不错的。可以说是目前最便宜的骁龙660智能产品&#xff0c;性能中端。加上小米MIU…

监视器java_Java监视器绑定的超人

监视器java这是超人生活中的黑暗时期。 乔尔艾尔&#xff08;Jor-El&#xff09;希望他继续航行&#xff0c;为他的最终命运做好准备。 然而&#xff0c;地球面临着世界末日&#xff0c;正义联盟需要他们的钢铁侠行动来拯救世界。 但是&#xff0c;由于我们只有一个超人&#x…

坚果pro2s android 8,锤子坚果Pro2S 安卓8.1 稳定版 超级流畅 火力全开 智能调频 省电稳定 优化简约...

、该ROM本人已经测试通过&#xff0c;如因操作不当造成的后果&#xff0c;本人以及论坛一概不承担任何责任&#xff1b;2、刷机前请保证电池有60&#xff05;以上的电量&#xff0c;并保证刷机过程中手机及电脑无任何异常&#xff1b;3、刷机有风险&#xff0c;第一次刷机者&am…

代码分析工具python_Python代码分析工具:PyChecker、Pylint

1 概述PyChecker是Python代码的静态分析工具&#xff0c;它能够帮助查找Python代码的bug&#xff0c;而且能够对代码的复杂度和格式等提出警告。PyChecker可以工作在多种方式之下。首先&#xff0c;PyChecker会导入所检查文件中包含的模块&#xff0c;检查导入是否正确&#xf…

flatMap()与concatMap()与concatMapEager()– RxJava常见问题解答

RxJava 2.x中共有三个无缝相似的运算符&#xff1a; flatMap() &#xff0c; concatMap()和concatMapEager() 。 它们都接受相同的参数-从原始流的单个项目到任意类型的&#xff08;子&#xff09;流的函数。 换句话说&#xff0c;如果您有Flowable<T>则可以为任意R类型提…

android的padding属性,以编程方式获取android:padding属性

从一个角度来看&#xff0c;如何以编程方式获取android&#xff1a;padding属性的值&#xff1f; 我目前正在使用&#xff1a;private static final String ANDROID_NAMESPACE "http://schemas.android.com/apk/res/android"; private static final String ATTRIBUT…