图片来源:Pixabay
来源:BMJ
翻译:阿金
审校:戚译引
许多研究宣称,人工智能在解读医学影像方面具备和人类专家同等甚至更强的能力。但是,BMJ 近期发表的一篇综述指出,这些研究质量堪忧,有明显的夸大成分,研究人员警告这危及上百万患者的生命安全。
他们的发现表明许多这类研究的证据质量堪忧,强调了提高研究设计和报告标准的需求。
人工智能是一个创新且发展迅速的领域,有望改善病患护理,缓解医疗服务负担过重的问题。深度学习则是人工智能的分支,在医学影像领域展现了独特的发展前景。有关深度学习的论文发表数量一直在增长,并且一些媒体在头条进行报道,大力宣扬 AI 的能力优于人类医生,为快速应用推波助澜。但是实际上,这些研究的方法和偏倚风险尚未经过详细检验。
为此,一支研究团队回顾了过去十年内发表的研究成果,比较了深度学习算法和临床专家解读医学影像的能力。他们发现,这其中只有 2 项合格的随机临床试验,和 81 项非随机研究。
在非随机研究中,只有 9 项属于前瞻性研究(随着时间的推移,对个体进行追踪和收集信息),并且只有 6 项在“现实世界”临床环境中进行了测试。
在比较组中,人类专家的平均人数只有 4 人,而原始数据和代码的获取(以允许对结果进行独立审查)受到严格限制。
超过三分之二(81 项中的 58 项)的研究被判定为存在高偏倚风险(研究设计存在问题,会影响结果),而对于公认的报告标准遵守情况常常很糟糕。
四分之三的研究(61 项)宣称,人工智能的表现至少能与临床医生相提并论(甚至更优秀),但只有 31 项(38%)表明需要进一步的前瞻性研究或者试验。
研究人员承认这篇综述也存在一些局限,比如可能会有遗漏的研究,而且研究只关注深度学习医学影像研究,所以结果可能无法应用于其他类型的人工智能。
无论如何,他们认为目前“存在许多明显夸大的说法,宣称人工智能的能力等同于(或者优于)临床医生,这在社会层面上给患者的安全和人口健康构成了潜在风险”。
他们还警告说:“过分夸大其词的话语很容易让媒体和公众误读这些研究,结果可能会导致不恰当的护理手段,那不一定符合患者的最佳利益。”
“要最大限度地保护患者的生命安全,最好的办法就是确保开发高质量、透明的报告证据库,并继续向前推进,”他们总结。
原文链接:
https://eurekalert.org/pub_releases/2020-03/b-co032320.php
论文信息
【标题】Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies
【作者】Myura Nagendran, Yang Chen, Christopher A Lovejoy, Anthony C Gordon, Matthieu Komorowski, Hugh Harvey, Eric J Topol, John P A Ioannidis, Gary S Collins, Mahiben Maruthappu
【期刊】BMJ
【日期】2020.03.25
【链接】https://www.bmj.com/content/368/bmj.m689
【论文摘要】
Objective To systematically examine the design, reporting standards, risk of bias, and claims of studies comparing the performance of diagnostic deep learning algorithms for medical imaging with that of expert clinicians.
Design Systematic review.
Data sources Medline, Embase, Cochrane Central Register of Controlled Trials, and the World Health Organization trial registry from 2010 to June 2019.
Eligibility criteria for selecting studies Randomised trial registrations and non-randomised studies comparing the performance of a deep learning algorithm in medical imaging with a contemporary group of one or more expert clinicians. Medical imaging has seen a growing interest in deep learning research. The main distinguishing feature of convolutional neural networks (CNNs) in deep learning is that when CNNs are fed with raw data, they develop their own representations needed for pattern recognition. The algorithm learns for itself the features of an image that are important for classification rather than being told by humans which features to use. The selected studies aimed to use medical imaging for predicting absolute risk of existing disease or classification into diagnostic groups (eg, disease or non-disease). For example, raw chest radiographs tagged with a label such as pneumothorax or no pneumothorax and the CNN learning which pixel patterns suggest pneumothorax.
Review methods Adherence to reporting standards was assessed by using CONSORT (consolidated standards of reporting trials) for randomised studies and TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) for non-randomised studies. Risk of bias was assessed by using the Cochrane risk of bias tool for randomised studies and PROBAST (prediction model risk of bias assessment tool) for non-randomised studies.
Results Only 10 records were found for deep learning randomised clinical trials, two of which have been published (with low risk of bias, except for lack of blinding, and high adherence to reporting standards) and eight are ongoing. Of 81 non-randomised clinical trials identified, only nine were prospective and just six were tested in a real world clinical setting. The median number of experts in the comparator group was only four (interquartile range 2-9). Full access to all datasets and code was severely limited (unavailable in 95% and 93% of studies, respectively). The overall risk of bias was high in 58 of 81 studies and adherence to reporting standards was suboptimal (<50% adherence for 12 of 29 TRIPOD items). 61 of 81 studies stated in their abstract that performance of artificial intelligence was at least comparable to (or better than) that of clinicians. Only 31 of 81 studies (38%) stated that further prospective studies or trials were required.
Conclusions Few prospective deep learning studies and randomised trials exist in medical imaging. Most non-randomised trials are not prospective, are at high risk of bias, and deviate from existing reporting standards. Data and code availability are lacking in most studies, and human comparator groups are often small. Future studies should diminish risk of bias, enhance real world clinical relevance, improve reporting and transparency, and appropriately temper conclusions.
未来智能实验室的主要工作包括:建立AI智能系统智商评测体系,开展世界人工智能智商评测;开展互联网(城市)云脑研究计划,构建互联网(城市)云脑技术和企业图谱,为提升企业,行业与城市的智能水平服务。
如果您对实验室的研究感兴趣,欢迎加入未来智能实验室线上平台。扫描以下二维码或点击本文左下角“阅读原文”