llm训练需要获取数据,互联网上会有一些别人开源的数据集,我们可以拿来即用
https://github.com/huggingface/datasets
https://huggingface.co/datasets
支持使用python直接调取,譬如squad_dataset = load_datasets(“squad”)。
https://datasetsearch.research.google.com/
https://www.kaggle.com/datasets
https://www.paperswithcode.com/datasets
https://www.cluebenchmarks.com/dataSet_search.html
https://www.datasetlist.com/
https://tinyletter.com/data-is-plural
https://jupyter-tutorial.readthedocs.io/en/latest/data/index.html
https://www.openml.org/search?type=data
https://github.com/InsaneLife/ChineseNLPCorpus
https://github.com/awesomedata/awesome-public-datasets