Item Analysis of Multiple Choice Questions at the Department of Paediatrics , Arabian Gulf University , Manama , Bahrain

Objectives
The current study aimed to carry out a post-validation item analysis of multiple choice questions (MCQs) in medical examinations in order to evaluate correlations between item difficulty, item discrimination and distraction effectiveness so as to determine whether questions should be included, modified or discarded. In addition, the optimal number of options per MCQ was analysed.


Methods
This cross-sectional study was performed in the Department of Paediatrics, Arabian Gulf University, Manama, Bahrain. A total of 800 MCQs and 4,000 distractors were analysed between November 2013 and June 2016.


Results
The mean difficulty index ranged from 36.70-73.14%. The mean discrimination index ranged from 0.20-0.34. The mean distractor efficiency ranged from 66.50-90.00%. Of the items, 48.4%, 35.3%, 11.4%, 3.9% and 1.1% had zero, one, two, three and four nonfunctional distractors (NFDs), respectively. Using three or four rather than five options in each MCQ resulted in 95% or 83.6% of items having zero NFDs, respectively. The distractor efficiency was 91.87%, 85.83% and 64.13% for difficult, acceptable and easy items, respectively (P <0.005). Distractor efficiency was 83.33%, 83.24% and 77.56% for items with excellent, acceptable and poor discrimination, respectively (P <0.005). The average Kuder-Richardson formula 20 reliability coefficient was 0.76.


Conclusion
A considerable number of the MCQ items were within acceptable ranges. However, some items needed to be discarded or revised. Using three or four rather than five options in MCQs is recommended to reduce the number of NFDs and improve the overall quality of the examination.

The current study aimed to carry out a postvalidation item analysis of MCQs used in end-of-rotation examinations between 2013-2016 at the AGU Department of Paediatrics.Based on the item analysis outcomes, recommendations were made as to whether the questions should be retained, modified or discarded from the AGU question bank.In addition, correlations between the difficulty, item discrimination and distraction effectiveness of each item were calculated and the optimal number of options in each MCQ was determined.

Methods
This cross-sectional study was performed in the Department of Paediatrics at AGU and included all MCQ items of paediatric clerkship summative examination papers from November 2013 to June 2016.There were 50 MCQs per paper and four examinations per year, resulting in a total of 800 MCQs and 4,000 distractors.In total, 608 students had taken the exam-  inations during the study period, with an average of 38 students sitting each examination.
Items were only used for summative assessment and were not reviewed with the students at any time.The content and construct validity of the examinations were verified by the Paediatric Examination Committee, which consisted of five content experts and paediatric consultants.Each examination was designed according to a predetermined examination blueprint, ensuring that all essential knowledge and skills were covered based on learning objectives.The post-validation item analysis was performed using the Oracle Database, Version 10g (Oracle Corp., Redwood City, California, USA).The committee discarded existing MCQs based on the item analysis results, flaws in MCQ construction and how frequently an item was used in previous years.The question bank was secured in the assessment office to which only authorised individuals were allowed access via a digital security system.Examinees entered their answers in pencil on a Scantron ® optical answer sheet (Scantron Corp., Tustin, California, USA).
The item analysis parameters used in the current study included the DIFI, DI and DE.The DIFI ranged from 0% (i.e.none of the students answered the item correctly) to 100% (i.e.all of the students answered the item correctly).In general, items with a DIFI of <30% were considered difficult, those between 30-70% were considered acceptable and those >70% were considered easy.Kelley's method was used to calculate the DI based on the difference between the scores of highachievers, classified as the top 27% of test-takers, and low-achievers, classified as the bottom 27% of test takers. 8The larger the difference between the high-and low-achieving groups, the higher the DI of an item.The DIs of items ranged from -1 (all and only low achievers answered correctly) to +1 (all and only high achievers answered correctly).Items with a DI of ≥0.35 were considered excellent, those between 0.2-0.34 were considered acceptable and those <0.2 were considered poor.
The DE was calculated based on the number of nonfunctional distractors (NFDs) per item.An NFD was defined as an incorrect MCQ option selected by less than 5% of students. 7The DE was deemed to be either 0%, 25%, 50%, 75% or 100% if an item had four, three, two, one or zero NFDs, respectively. 9The reliability of the examination was measured using the Kuder-Richardson formula 20 coefficient (KR 20 ); this value usually ranges from 0-1, with higher KR 20 values (i.e.closer to 1) indicating greater reliability.A KR 20 value of <0.3 is considered poor and a value of ≥0.7 is considered acceptable.Items with DIFIs of >70% or <30% usually yield a low KR 20 value, as do items with a DI of <0.2. 2,10,11ata analysis was performed using the Statistical Package for the Social Sciences (SPSS), Version 23.0 (IBM Corp., Armonk, New York, USA).Variables were presented as means ± standard deviations.The linear relationship between DIFI and DI was measured using Pearson's correlation test.A two-way analysis of variance was used to examine the differences in DE (dependent variable), DIFI (independent variable one) and DI (independent variable two).A P value of <0.050 was considered statistically significant.
The Vice Dean for Academic Affairs at AGU approved this study and allowed access to the examination data.The identities of the students taking the examination were kept anonymous and confidential.No human participants were involved in this study.

Results
The mean DIFI of the examinations ranged from 36.70% in 2016 to 73.14% in 2013, with the overall mean DIFI considered acceptable (52.15%).The overall mean DI and DE ranged between 0.20-0.34and 66.50-90.00%,respectively [Table 1].Of the total number of items, 48.4%, 35.3%, 11.4%, 3.9% and 1.1% had zero, one, two, three and four NFDs, respectively [Table 2].It was calculated that using four rather than five options in each MCQ by removing one NFD would result in 83.6% of items having zero NFDs.Using three options and removing two NFDs resulted in 95% of items having zero NFDs.
The overall DIFI increased as the number of NFDs increased, with DIFIs of 42.62%, 54.36%, 68.65%, 88.62% and 100% for items with zero, one, two, three and four NFDs, respectively.This finding was observed in the mean DIFI for each year as well.The overall mean DIFI was almost the same for items with zero, one and two NFDs (0.27%, 0.28% and 0.28%), while they were 0.15% and 0% for items with three and four NFDs, respectively.Similar results were observed for the mean DI of each year as well.More than 40% of the MCQs had an acceptable DIFI throughout the study period.The lowest percentage of difficult MCQs was observed in 2013 (7.5%).The highest percentage of difficult MCQs was 31.5%, noted in 2015.There were more easy MCQs in 2013 than in 2015 [Figure 1A].Item DIs were relatively constant across the study period [Figure 1B].
Approximately half of the items had an acceptable DIFI (53.4%), while the other half were either difficult (20.8%) or easy (25.9%).The DE was directly related to the DIFI, with DEs of 91.87%, 85.83% and 64.13% for difficult, acceptable and easy items, respectively (P <0.005).Items were nearly equally distributed between poor, acceptable and excellent DIs.The DE was 83.24% and 83.33% for items with excellent and acceptable DIs, respectively, compared to 77.56% for items with poor discrimination (P <0.005) [Table 3].There was a significant dome-shaped correlation between DIFI and DI (r = 0.162; P = 0.010), with the highest DIs occurring in the acceptable DIFI range and decreasing for DIFIs in the difficult range [Figure 2].Other studies have reported mean DIFIs of 39.4 ± 21.4% and 63.06 ± 18.95%, respectively. 9,13Keralia et al. reported mean DIFIs between 47.17-58.08% in MCQ items from 10 summative 14 Sharif et al. reported a mean DIFI of 49 ± 31% in 2,445 MCQs. 15In the basic medical sciences component of a nursing licensure examination, Lin et al. found the DIFI to range from 10-93%, with a mean of 48%. 16relia et al. reported that 61 ± 8.43%, 24 ± 4.08% and 15 ± 7.07% of items in pharmacology summative tests were acceptable, very easy and very difficult, respectively. 14In the current study, 53.4%, 25.9% and 20.8% of items fell within these same categories.The authors recommend selecting MCQs with lower DIFIs for fundamental topics that students will probably know; moreover, starting the examination with such questions will raise the students' confidence.Similarly, MCQs with a high DIFI should be located nearer the end of the paper in order to discriminate between high-and low-achievers.With regards to DI, a nearly equivalent percentage of items in the current study were in the poor, acceptable and excellent ranges (31.8%, 33.4% and 34.9%, respectively).Lin et al. reported that 28.8% of MCQ items in the basic medical sciences section had a DI of <0.2. 16Other studies have reported mean DIs of 0.14 ± 0.19, 0.356 ± 0.17, 0.19 ± 0.30 and 0.33 ± 0.18. 6,9,13,15tems with poor DIs usually result in low scores due to the use of incorrect answer keys, confusing stems or areas of controversy. 17,18Such items should be removed from the question bank as they fail to discriminate between strong and weak academic performances.
Constructing plausible distractors and decreasing NFDs is essential to improve the quality of MCQs. 19herefore, items may need to be modified if students constantly avoid choosing certain distractors.In the current study, most questions had less than two NFDs, with a mean DE of 66.5-90.00%.Other studies have reported a mean DE of 88.6 ± 18.6% and 63.97 ± 33.56%. 9,13 Items with high NFDs reduce both the DE and DI, but increase the DIFI; thus, the item will be easy for the students and act as a poor discriminator of academic performance.In the current study, the DE was significantly higher among difficult items compared to acceptable and easy items as well as significantly higher among items with excellent and acceptable DIs over poor ones.Difficult items with excellent DE values need to be reviewed for possible language confusion, sufficient subject coverage or inappropriately chosen material according to the student's level of learning.In contrast, easy items with low DE values should be discarded, while items with acceptable DIFI and DE values can be stored and reviewed for improvement.It is often necessary to revise items in which the distractor is selected more often than the correct answer. 20The number of NFDs also affects DI, in that items with lower NFDs are associated with acceptable or excellent DIs.The current study found that items with excellent and acceptable DIs had a significantly higher DE than items with a poor DI.The authors recommend discarding items with poor DIs and low DEs, while retaining items with acceptable or excellent DIs and high DEs.
In the current study, items with NFDs of zero, one and two had acceptable DIFIs and DIs, while items with NFDs of three and four had higher DIFIs and poorer DIs.Mukherjee et al. reported a similar association, with DIFIs of 32.5%, 51.36%, 71.11% and 87.08% for items with zero, one, two and three NFDs, respectively, in a community medicine assessment; only items with NFDs of one and two had acceptable DIs (0.396 and 0.404, respectively), while items with NFDs of zero and three had poor DIs (0.023 and 0.195, respectively). 21Items which reflect fundamental knowledge should be retained each year to determine whether all students continue to answer them correctly.While some may argue that the inclusion of more options in an MCQ reduces the 'guessing effect' , others have demonstrated that additional options beyond three do not make much difference; in reducing the list of available responses to three options can actually improve psychometric features. 22,23urthermore, it is easier to develop three rather than four or five MCQ options and more effective to have fewer options with a greater number of functional distractors in comparison to increased options and more NFDs.Tarrent et al. suggested including three instead of four options, as such questions require less time to be constructed and the performance for both is equal. 24A meta-analysis of 80 years of research concluded that three options are optimal for MCQ items, resulting in a reduction in the amount of time required to prepare each MCQ and allowing more questions to be set per examination. 19In addition, this will increase subject exposure and improve the reliability and validity of the test due to the inclusion of more high-quality items.According to the domeshaped correlation between DIFI and DI in the current study, items with DIFIs falling in the difficult or easy categories had significantly poorer DIs.Sim et al. similarly reported that maximum DI values were seen with DIFIs between 40-74%. 25The reliability coefficient in the current study was 0.76, which is less than excellent but still within the desirable range. 2,10,11onstructing high-quality MCQs is essential to accurately assess student performance.Overall, for students who know the material covered by the examination, NFDs add little to the performance of a test item; in contrast, increasing the number of distractors decreases the likelihood of students accidentally choosing the correct answer by guesswork.An item analysis of questions is recommended for all examinations in order to continuously update the question bank by keeping items with acceptable indices and revising or discarding others.In the authors' experience, it is usually better to construct an examination with the input of an examination committee in order to improve the quality of the questions.Special training programmes or workshops should be offered to the members of such committees in order to hone their skills in preparing effective MCQs.Further research at AGU is recommended to determine any future improvements in MCQ preparation.Conducting similar studies for examinations in other disciplines at AGU would also be useful.

Conclusion
Item analyses can be valuable to strengthen an MCQ bank in order to ensure the items have an acceptable DIFI, acceptable or excellent DI and high DE.The item analysis of paediatric end-of-rotation examinations at AGU indicated that a considerable percentage of test items had acceptable mean DIFIs and DIs.However, some items needed to be discarded or revised.Using three or four rather than five options in an MCQ is recommended.The researchers would like to thank the Assessment Office at AGU for providing the MCQ database and helping with the item analysis.

Figure 1 :
Figure 1: Distribution according to (A) difficulty index and (B) discrimination index of multiple choice questions in endof-rotation paediatric examinations at the Arabian Gulf University, Manama, Bahrain (N = 800).

Figure 2 :
Figure 2: Scatter plot showing the relationship between difficulty index and discrimination index among multiple choice question items in end-of-rotation paediatric examinations at the Arabian Gulf University, Manama, Bahrain (N = 800).
a c k n o w l e d g e m e n t s

Table 2 :
Non-functioning distractors per individual multiple choice question items in the end-of-rotation paediatric examinations at the Arabian Gulf University, Manama, Bahrain (N = 800) NFDs = non-functioning distractors; DI = discrimination index.

Table 3 :
Correlation between difficulty index and discrimination index with distractor efficiency and action proposed of multiple choice questions in end-of-rotation paediatric examinations at the Arabian Gulf University, Manama, Bahrain (N = 800) *The DE was significantly different for difficult, acceptable and easy items.† Poor DIs had a significantly lower DE than both acceptable and excellent DIs.Item Analysis of Multiple Choice Questions at the Department of Paediatrics, Arabian Gulf University, Manama, Bahrain e72 | SQU Medical Journal, February 2018, Volume 18, Issue 1 Discussion In the current study, out of 16 summative examinations and 800 items, the mean DIFI of individual tests was acceptable.Items with a high DIFI mostly occurred in examination papers from 2015 and 2016, while items with a low DIFI mostly occurred in 2013 examination papers.It is likely that this finding reflects recent improvements in MCQ construction by the AGU Examination Committee.The DIFI results of the current study were comparable to those of other institutions, although relative incentives and test conditions are unlikely to have been the same.Mitra et al. reported mean DIFIs ranging from 64-89% among 12 summative assessments in their foundation programme conducted between 2003 and 2006.
Over the study period, there was a gradual improvement in mean DE from 2013 (70.88%) to 2016 (86.88%); this was likely due to the continuous improvement activities of the AGU Examination Committee.This improvement is also reflected in the annual number of NFDs.Items with zero NFDs increased from 32% in 2013 to 44%, 58% and 59.5% in 2014, 2015 and 2016, respectively, while items with three NFDs decreased from 9% in 2013 to 3.5%, 2% and 1% in 2014, 2015 and 2016, respectively.Items with four NFDs decreased from 4% in 2013 to 0%, 0.5% and 0% in 2014, 2015 and 2016, respectively.