Gemini and Bard Excel in Ophthalmology Board Exam

In a recent paper published in the journal Eye, Canadian researchers have evaluated the capabilities of two artificial intelligence chatbots, Google Gemini and Bard, in the ophthalmology board examination. They found that both AI tools achieved acceptable accuracy levels and performed well within the field of ophthalmology, although some variations were noted across different countries.

Background on AI in Medical Assessments

AI chatbots like ChatGPT, Bard, and Gemini are becoming increasingly prevalent in medical environments. Their effectiveness varies across various exams and disciplines, with newer versions showing enhanced performance. For instance, while ChatGPT-3.5 had an accuracy rate of up to 64% in certain medical exams, later versions like ChatGPT-4 have demonstrated improvements.

Google's Bard and Gemini, which generate responses based on a wide-ranging cultural and linguistic dataset, are tailored to provide information specific to different countries. This variation in responses across geographies highlights the need for continued research to ensure uniformity, especially in medical contexts where precision is critical for patient safety.

Study Goals and Methods

The objective of this study was to understand the performance of Google Gemini and Bard on a set of practice questions from the ophthalmology board certification exam. Researchers used 150 multiple-choice questions from the "EyeQuiz" platform, which is designed for ophthalmology professionals.

These questions were sourced from various examinations including the Ophthalmic Knowledge Assessment Program (OKAP) and the American Board of Ophthalmology (ABO) exam. The accuracy, clarity of explanations, response time, and length of responses were all criteria evaluated for both chatbots.

Comparative Performance and Results

Both Bard and Gemini responded efficiently to all 150 questions. In the U.S. version of the analysis, both chatbots achieved a 71% accuracy rate. Bard excelled in orbital and plastic surgery, while Gemini was more effective in general ophthalmology and several other sub-specialties.

Secondary analysis using VPNs showed that Bard and Gemini’s performance varied slightly in countries like Vietnam, Brazil, and the Netherlands, with some questions being answered differently compared to the U.S. versions.

Conclusions and Future Directions

The study concludes that both the U.S. and Vietnam versions of Bard and Gemini show promising performance on ophthalmology practice questions, though there is notable variability based on the user's location. The potential for these AI chatbots to interpret ophthalmic images is an area ripe for future research, as is their comparative effectiveness alongside human residents in ophthalmology.

This comprehensive evaluation not only sheds light on the current capabilities of AI tools in medical examinations but also sets the stage for further advancements in this rapidly evolving field. Future studies are encouraged to delve deeper into these preliminary findings and expand the scope of AI applications in medicine.

Reference

Mihalache, A. et al., (2024) Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment. Eye. DOI: https://doi.org/10.1038/s41433-024-03067-4. https://www.nature.com/articles/s41433-024-03067-4