Artificial intelligence (AI) in neurology
AI offers advantages in terms of speed, accuracy, and data processing. But what implications for everyday clinical practice are we seeing already?
AI research in medicine has accelerated significantly in recent years
- Although AI offers advantages in terms of speed, precision and the ability to process large amounts of data, the expertise of experienced clinicians and the reliability of proven diagnostic procedures still play an important role in the clinical context1
- Experts continue to express concerns that AI - if implemented indiscriminately - could worsen outcomes for many patients2
'It is a serious mistake to treat health as if it were like any other consumer need' ‘2
Dr Josh Tamayo-Sarver, PhD, an American physician and advocate of using artificial intelligence to improve quality and efficiency in healthcare, tested how many diagnoses the bot developed by OpenAI would make correctly in the normal day-to-day running of an emergency department. He found that this quickly became critical for patients.
As the owner of five US patents in the field of healthcare technology, including two relating to machine learning models, Dr Tamayo-Sarver did not ‘give up’ on ChatGPT because of this, but continued to use and test it. In a later article, he nevertheless summarised: ‘While programmes like ChatGPT are very exciting for the future of medicine, they also come with some worrying drawbacks [...].’ 2
One major problem is that the AI-generated answers are extremely dependent on the accuracy of the question and the quality of the input, and therefore on the user. An inherent bias was that only the issues that the user comes up with, are the ones that can be answered at all.
Furthermore, AI algorithms are only as good as the data they are trained with. Bias is a known problem in AI systems, especially in healthcare, where the training data is not necessarily representative of diverse patient populations.1
Doctors outperform chat GPT when cases are not in multiple-choice format
Furthermore, no matter how much information is available, one thing cannot be replaced: clinical judgement. Although ChatGPT can answer multiple-choice questions, reproduce facts and respond to known questions, it reaches its limits with case histories - especially if the clinical presentation is not classic or in ‘textbook’ format.
This was also impressively demonstrated by a Swedish study in which the performance of ChatGPT (version GPT-4) for writing free-text assessments of complex cases in primary care, was compared with that of real doctors.4 In these case vignettes, which were borrowed from the specialist examination for general practice, even average doctors achieved significantly better results than ChatGPT, with the best doctors being more clearly ahead.
Data is also available for the British neurology specialist examination, which analyses the performance of different ChatGPT versions - albeit without a direct comparison to human doctors. ChatGPT 3.5 Legacy and ChatGPT 3.5 Default achieved 42% and 57% respectively, falling short of the pass mark of 58%. ChatGPT-4, on the other hand, achieved the highest pass rate of 64%. In practice, 6.4 out of 10 correct answers would hardly be acceptable.
Could AI do our job?
‘In the months I spent experimenting with ChatGPT during my shifts in the emergency department, I learned that ChatGPT is extremely limited and risky as an independent diagnostic tool - but extremely valuable as a tool for explaining complex medical processes to patients,’ concludes Dr Tamayo-Sarver.2
Even outside of patients and emergencies, it is clear that input from experts in the field is currently essential. A paper in the BMJ on the creation of medical literature reviews using ChatGPT also concluded that it is not currently suitable for professional or speciality-specific information.6,7
Did you know that AI chatbots can be ‘demented’?
The Mini Mental Status Test (MMST) is one of the most important dementia tests, but the Montreal Cognitive Assessment (MoCA) test is considered to be much more sensitive for detecting mild cognitive impairment. The currently leading large language models (LLMs) were subjected to this and most showed signs of mild cognitive impairment.8
The older the chatbot, the more the scores looked like cognitive decline. ‘These findings call into question the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment seen in leading chatbots could affect their reliability in medical diagnosis and undermine patient confidence.’
The study appeared in the Christmas issue of the BMJ. This always contains real, but at the same time humorous studies and reports. The comparison with the human brain is naturally unfair, the authors admit. Nevertheless, they emphasise that the consistent failure of all major language models in tasks requiring visual abstraction and executive functions is a significant weakness and could limit their use in clinical settings.9
The authors' conclusion here was: 'Not only is it unlikely that neurologists will be replaced by large-scale language models in the foreseeable future, but our data also suggest that they will soon have to treat new, virtual patients - artificial intelligence models that exhibit cognitive dysfunction.' 8
- Abu Alrob, M. A. & Mesraoua, B. Harnessing artificial intelligence for the diagnosis and treatment of neurological emergencies: a comprehensive review of recent advances and future directions. Front. Neurol. 15, (2024).
- Tamayo-Sarver, J. I’m an ER doctor. I think LLMs may shape the future of medicine—for better or worse. Fast Company https://www.fastcompany.com/90922526/er-doctor-ai-medicine (2023).
- Senthil, R., Anand, T., Somala, C. S. & Saravanan, K. M. Bibliometric analysis of artificial intelligence in healthcare research: Trends and future directions. Future Healthcare Journal 11, 100182 (2024).
- Arvidsson, R., Gunnarsson, R., Entezarjou, A., Sundemo, D. & Wikberg, C. ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study. BMJ Open 14, e086148 (2024).
- Giannos, P. Evaluating the limits of AI in medical specialisation: ChatGPT’s performance on the UK Neurology Specialty Certificate Examination. BMJ Neurol Open 5, e000451 (2023).
- Admir Hadzic on LinkedIn: #aiinmedicine #chatgpt #subspecialtymedicine #medicalreview #expertise… | 15 Comments. https://www.linkedin.com/posts/hadzic-admir_aiinmedicine-chatgpt-subspecialtymedicine-activity-7084148289008230401-2nvS.
- Wu, C. L. et al. Addition of dexamethasone to prolong peripheral nerve blocks: a ChatGPT-created narrative review. Reg Anesth Pain Med 49, 777–781 (2024).
- Dayan, R., Uliel, B. & Koplewitz, G. Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis. BMJ 387, e081948 (2024).
- Barton, E. Almost all leading AI chatbots show signs of cognitive decline - BMJ Group. https://bmjgroup.com/almost-all-leading-ai-chatbots-show-signs-of-cognitive-decline/ (2024).