Objective
To compare the diagnostic performance of an artificial intelligence deep learning system with that of expert neuro‐ophthalmologists in classifying optic disc appearance.
Methods
The deep learning system was previously trained and validated on 14,341 ocular fundus photographs from 19 international centers. The performance of the system was evaluated on 800 new fundus photographs (400 normal optic discs, 201 papilledema [disc edema from elevated intracranial pressure], 199 other optic disc abnormalities) and compared with that of two expert neuro‐ophthalmologists who independently reviewed the same randomly‐presented images without clinical information. Area‐under‐the‐receiver‐operating‐characteristic‐curve, accuracy, sensitivity and specificity were calculated.
Results
The system correctly classified 678/800 (84.7%) photographs, compared with 675/800 (84.4%) for Expert 1 and 641/800 (80.1%) for Expert 2. The system yielded area‐under‐the‐receiver‐operating‐characteristic‐curves of 0.97 (CI 95%, 0.96 ‐ 0.98), 0.96 (CI 95%, 0.94 ‐ 0.97) and 0.89 (CI 95%, 0.87 ‐ 0.92) for the detection of normal discs, papilledema and other disc abnormalities, respectively. The accuracy, sensitivity and specificity of the system’s classification of optic discs were similar or better than the two experts. Inter‐grader agreement at the eye level was 0.71 (CI 95%, 0.67‐0.76) between Expert 1 and Expert 2, 0.72 (CI 95%, 0.68‐0.76) between the system and Expert 1, and 0.65 (CI 95%, 0.61‐0.70) between the system and Expert 2.
Interpretation
The performance of this deep learning system at classifying optic disc abnormalities was at least as good as two expert neuro‐ophthalmologists. Future prospective studies are needed to validate this system as a diagnostic aid in relevant clinical settings.