Agreement of Revised Premature Infant Pain Profile Scoring between Healthcare Providers and Laypersons: A Cross-sectional Study
Correspondence Address :
Mr. Ajay Gajanan Phatak,
Professor, Department of Central Research Services Academic Centre Bhaikaka University, Karamsad-388325, Anand, Gujarat, India.
Introduction: The experience of pain during the neonatal period has short and long-term consequences. The Revised Premature Infant Pain Profile (PIPP-R) is a globally accepted and validated tool for assessing pain in neonates. Adequate pain management measures can be implemented using the PIPP-R, even in the absence of consultants.
Aim: To assess the agreement among healthcare providers and laypersons in scoring the PIPP-R.
Materials and Methods: A cross-sectional study was conducted at Shree Krishna Hospital, a rural Tertiary Care Teaching Hospital in central Gujarat, India. The duration of the study was one year and six months, from January 2021 to June 2022. The study included 12 volunteers from various fields, such as consultant neonatologists, first year postgraduate students in Department of Paediatrics, neonatal nurses, social workers, Bachelor of Medicine, Bachelor of Surgery (MBBS) interns, and mothers of newborns. A neonatology consultant provided training on the PIPP-R scoring system using handouts and a presentation. The volunteers then evaluated 100 prerecorded videos of newborns undergoing painful procedures. Agreement between volunteers for the total PIPP-R score and its subcomponents was assessed using Bland-Altman analysis and Cohen’s Kappa statistics.
Results: A total of 100 videos of newborns (51 girls, 49 boys) undergoing painful procedures were evaluated for the PIPP-R score. The mean age, gestational age, and birth weight of the newborns were 2.21±1.55 days, 37±2.44 weeks, and 2.56±0.72 kg, respectively. The procedures included heel prick for Random Blood Sugar (RBS) (44%), intravenous sampling/insertion (34%), and intramuscular vitamin K injection (22%). The mean difference with 95% Confidence Limits (CL) of total PIPP-R scores between the two consultants (neonatologists) was -0.640 (-5.196, 3.916). The length of the CL was -9.112, which fell outside the defined CL of 4.2 (20% of the total score), indicating unacceptable agreement between the two consultants. Similarly, agreement between each consultant and any of the other participants, including residents, nurses, interns, mothers, and social workers, regarding the total PIPP-R score, as well as its subcomponents, was also deemed unacceptable.
Conclusion: The present study concluded that the inter-rater reliability of the PIPP-R score and its subcomponents was unacceptable between consultants and with any of the other participants.
Agreement, Neonates, Painful procedures, Pre-recorded video, Reliability
Preterm babies are delivered prematurely before their anatomy and physiology are capable of sustaining them in the extrauterine environment. In the Neonatal Intensive Care Unit (NICU), infants are subjected to a hostile environment and various tissue damaging procedures as part of their clinical care (1). Pain is a continuous or periodic unpleasant feeling that can be dull, acute, or piercing in nature. Newborns experience pain when they are sick or when they undergo diagnostic and therapeutic treatments (2),(3). On average, babies undergo 14 painful procedures in the first two weeks of life (4). These procedures elicit varying degrees of pain, which can have short-term, as well as long-term consequences (5),(6),(7),(8). Evaluating pain in newborns and young children is more complex and challenging than in adults. There are several techniques and approaches for measuring pain in newborns like Neonatal Infant Pain Scale (NIPS), Neonatal Facial Coding System (NFCS), Neonatal Pain Agitation and Sedation Scale (N-PASS), and PIPP-R, among others (2),(9).
The PIPP-R is a feasible and validated tool for pain assessment in preterm and term neonates. Although the name is “PIPP-R,” it is validated for neonates with a gestational age of 26-40 weeks. PIPP-R consists of three behavioural, two physiological, and two contextual indicators. Each indicator is assessed on a 4-point scale. The instrument requires physicians to evaluate the neonate’s behavioural state and monitor physiological changes in Heart Rate (HR), oxygen saturation, and facial expression as potential markers of pain (10),(11),(12).
The primary objective of all caregivers should be to evaluate the newborn baby’s pain and take steps to minimise it in order to avoid these adverse effects (5). The mother (caretaker) and healthcare providers should be able to interpret the information expressed by the neonate to assess pain. Pain evaluation should be part of a holistic approach to the child’s care, and clinicians and other healthcare professionals must regularly measure pain in real-time. It was hypothesised that PIPP-R scoring has a low learning curve, allowing anyone (volunteer/participant) to master this skill with short training. However, the reliability of PIPP-R between investigators has not yet been examined. Therefore, the current study was carried out with the aim of studying the inter-rater reliability of the total PIPP-R score and its components among healthcare providers/laypersons.
A cross-sectional study was conducted at Shree Krishna Hospital, a rural Tertiary Care Teaching Hospital in central Gujarat, India. The duration of the study was one year and six months, from January 2021 to June 2022. The study was approved by the Institutional Ethics Committee (IEC), registered with the Central Drugs Standard Control Organisation (CDSCO), on 19 November 2020 (IEC/HMPCMCE/2020/Ex.34/279/20). Informed written consent was taken from each volunteer for the study.
Inclusion criteria: Videos of 100 physiologically stable newborns were included for assessment. The study included two volunteers (participants) from different fields, namely consultant neonatologists, social workers, MBBS interns, 1st year postgraduate students from paediatrics, nurses, and mothers of newborns, to score the PIPP-R from the videos. Thus, a total of 12 volunteers were selected for the assessment of PIPP-R scoring.
Exclusion criteria: Neonates requiring any respiratory support, sedatives, or analgesics, having hypoxic ischaemic encephalopathy, or any congenital anomaly were excluded from the study.
Sample size calculation: Bland JM and Altman DG, suggested a minimum of 100 records to provide reasonably stable estimates of the 95% Confidence Interval (CI) for agreement studies (13). So, from the video collections of previous studies, a total of 100 videos of neonates undergoing a pain procedure were selected for the current study.
A training session was conducted for the participants by two consultants from the Neonatology Department, who were co-investigators of the study. The participants were educated about the importance of identifying neonatal pain, the components of the PIPP-R scoring system, and the calculation of the total PIPP-R score using handouts for one hour. The participants then independently assessed 10 videos, and these videos were individually discussed with each participant by the consultant neonatologist to ensure accurate scoring. Any questions or difficulties raised by the participants were addressed. By the end of assessing the 10 videos, all participants were found competent in scoring the videos for PIPP-R. This process took approximately two hours for each participant.
Within 15 days after the completion of the training sessions, five tablets were arranged, each containing 100 pre-recorded videos of newborns who had undergone painful procedures at the Institute. Each participant group (social workers, interns, residents, nurses, and mothers) received one tablet each, except for the consultants. The scoring process of the 100 videos took each participant about three weeks before the tablet was transferred to the next participant in the group. This scoring process was completed in approximately two months. After the tablets were returned by the other participant groups, the consultants were provided with one tablet each. Due to their busy schedules, the consultants took about three months to perform the scoring.
The PIPP-R scoring system includes indicators viz., changes in heart rate, decreases in oxygen saturation, brow bulge, eye squeeze, nasolabial furrow, gestational age, and baseline behavioural state. Each indicator was scored on a scale of 0-3. Therefore, the total PIPP-R score ranges from 0-21 (11). The entire procedure of 2measuring the PIPP-R score was divided into four steps.
• Step 1: Observing an infant at first for 15 seconds just before the procedure to record the highest HR, lowest oxygen saturation, and behavioural state.
• Step 2: Observing an infant for 30 seconds immediately after the procedure to record changes in the form of the highest HR, lowest oxygen saturation, and duration of each facial action.
• Step 3: Scoring for contextual items based on the changes.
• Step 4: Calculating the total score by adding up the scores of all the items.
The total score represents the intensity of pain, with a higher score indicating a higher degree of pain. The agreement among assessors was evaluated for the total score. However, for clinical decision-making, the total PIPP-R score is categorised as follows: scores of 6 or less generally indicate minimal or no pain, scores between 6-12 are considered mild pain, and scores greater than 12 reflect moderate to severe pain (10). It is important to note that these categories were not used to classify the intensity of the pain.
Bland-Altman analysis was used to assess agreement among different volunteers (13). It was decided that the mean difference in the total score should be between -1 and +1, and the length of the CL should be within 20% to 25% of the total PIPP-R score. The maximum total PIPP-R score for preterm newborns is 21, and for term newborns, it is 18. Thus, a range of CI below 4.2 (i.e., 20% for the preterm newborn’s PIPP-R score and 23.33% for the term newborn’s PIPP-R score) was considered as acceptable agreement. For facial expressions, the total score is 9, so 20% of that (i.e., 1.8) was considered an acceptable range for the CL. Cohen’s Kappa statistic measures the inter-rater reliability of categorical data (14). Cohen’s Kappa was used to assess agreement between different components of PIPP-R among consultants, as well as between consultants and other volunteers. It was interpreted as none, minimal, weak, moderate, strong, and almost perfect agreement if the Kappa was in the range of 0 to 0.20, 0.21 to 0.39, 0.40 to 0.59, 0.60 to 0.79, 0.80 to 0.90, and 0.91 to 1.00, respectively. STATA (14.2), Stata Corp LLC, Texas, United States of America (USA) was used to analyse the data.
A total of 100 recorded videos of newborns undergoing painful procedures were evaluated for PIPP-R scores. The videos included 51 girls and 49 boys. The mean age, gestational age, and birth weight of the newborns were 2.21±1.55 days, 37±2.44 weeks, and 2.56±0.72 kg, respectively. The procedures involved heel prick for RBS estimation (44%), intravenous sampling/line insertion (34%), and intramuscular vitamin K administration (22%). All the assessors were middle-aged (25-40 years). Both consultants had more than a decade of experience in paediatrics, while both nurses had 5+ years of experience. In all participant groups except for nurses, mothers, and interns, one male and one female assessor were included. The mean±SD oxygen saturation (SpO2) in percentage (%) and HR at baseline for the babies were 94.61±3.54 and 147.60±18.81 Beats Per Minute (BPM), respectively. The mean±SD values of all the components of PIPP-R, along with the total score, are provided in (Table/Fig 1). The mean difference (95% CI) in PIPP-R scores between the consultants was -0.640 (-5.196, 3.916). This means that the mean difference between consultants in PIPP-R scores of the 100 videos is -0.64 units, and about 95% of the differences are within -5.196 and +3.916 units (Table/Fig 2).
The length of the confidence interval is 9.112, which is outside the defined confidence limit of 4.2 (20% of the total score), suggesting unacceptable agreement. The mean difference for agreement between consultants and other assessors ranged from -2.75 to -0.14, indicating that other assessors probably underestimated the pain. Although unacceptable, nurses assessed PIPP-R better than others (Table/Fig 3).
The agreement of facial expression parameters (combined score of brow bulge, eye squeeze, and nasolabial furrow) of PIPP-R scores between consultants and the rest of the assessors was also found to be unacceptable with a similar trend. Most assessors underestimated the pain compared to consultants, and nurses had better agreement with consultants, though still unacceptable (Table/Fig 4).
Even after categorising behaviour state and changes in SpO2 and HR according to the PIPP-R scoring instructions, a weighted Kappa (with quadratic weights) showed poor inter-rater reliability, although nurses exhibited acceptable Kappa values with consultants (Table/Fig 5).
Subtle observations: During the process of assessing heart rate and SpO2, it was observed that in some newborns, after the procedure, the heart rate dropped (21%), and SpO2 increased (17%) according to the consultant’s assessment.
The present study was conducted to assess the agreement between consultants and other healthcare workers, as well as laypersons, for PIPP-R. In the present study, 100 prerecorded videos of newborns undergoing painful procedures were examined by study participants (volunteers). Overall, there was unacceptable agreement between the consultants and the rest of the participants. There are many one-dimensional and multidimensional pain evaluation measures for newborns (15). The PIPP-R, an upgraded version of the original PIPP, is a multidimensional pain assessment instrument. Since the item statements on the scale were altered to make them more comprehensible, pain evaluation in disadvantaged groups is considered more objective due to the improved scoring system and the broad range of gestational ages for which it may be used for pain assessment (16). Results and subtle observations from the current study indicate that recording PIPP-R is not easy, and it is not a straightforward process for everyone, as even the scores did not agree between the experienced consultants.
There have been a few attempts to assess the concordance between two assessors in the PIPP-R scoring. The reliability of PIPP-R scores, in terms of Intraclass Correlation (ICC), was found to be good when the scoring was performed by two competent nurses (16). Similar findings were reported when three specialists assessed the PIPP-R scores (17). Another validation study also reported very high ICC among three nurses for PIPP-R scores (18). These studies indicated that within a subspecialty, PIPP-R is a reliable tool in terms of ICC. In contrast, the agreement between two consultants was found to be unacceptable in the current study. This discrepancy might be due to the fact that ICC is mathematically equal to Kappa, which is a measure of agreement for categorical data. There have been a few attempts to check the concordance between different groups (nurses/parents/physicians, etc.) in assessing pain. A study conducted in the pediatric emergency department reported discordance between nurses and parents on the Face Legs Activity Cry Consolability (FLACC) scale in children below four years of age (19). Another study from the pediatric emergency department reported poor agreement between patients and caregivers in pain assessed through the Wong-Baker FACES (WBF) and Faces Pain Scale-Revised (FPS-R) scales (20). These findings corroborate with the results of the current study.
Perception of pain and pain scores may vary from person to person based on their previous experiences and their relationship with the patients (19),(21). Zhou H et al., conducted a meta-analysis of 12 studies investigating the association between self-reported pain ratings for dyads consisting of a child and parent, a child and nurse, and a parent and nurse. They concluded that assessments of children’s pain by nurses and parents provide rough estimates rather than an accurate reflection of what children are actually experiencing (21). The authors found that the assessment of changes in heart rate and SpO2 did not agree between the two individuals, even among the consultants. Although changes in physiological markers are detected in newborns undergoing painful procedures, it is doubtful if they accurately assess pain, as they are a result of sympathetic nervous system activation and may represent general discomfort rather than specific pain. These markers are also reported in response to non painful stimuli, making it challenging to interpret them solely as indicators of pain. Nevertheless, they are recognised as objective markers in composite pain measurements (22).
Participants observed that while assessing the PIPP-R score, the assessor has to simultaneously focus on multiple parameters, including behaviour, identifying maximum heart rate and minimum SpO2 before and immediately after the procedure, all within a strict time-bound manner. Placing the pulse oximetry probe is necessary to record pulse rate and SpO2. Due to the painful procedure, newborns often move their hands and legs, causing changes in the waveform and heart rate on all types of SpO2 monitors, including Masimo pulse oximeters. This reduces the accuracy of the PIPP-R score. The current study was based on video assessments, allowing the ability to replay the video to carefully evaluate the individual components, which might be very difficult in real-time assessments.
An alternative to PIPP-R could be the use of simpler scales that contain fewer components, have better inter-rater reliability, and are less time-bound (21). The Neonatal Infant Pain Scale (NIPS) is a multidimensional pain scale designed for use in newborns. It contains indications for facial expressions, crying, breathing patterns, arm and leg movements, level of arousal, as well as one physiological signal (23). The NIPS can be considered a reliable, valid, and clinically relevant instrument with high practical importance (24). Oliveira NRG et al., assessed the correlation, internal consistency, and reliability between two experts in physical therapy who have extensive technical experience in neonatology, in assessing pain using NIPS and PIPP-R. They found high internal consistency for NIPS (r=0.824) and moderate for PIPP-R (0.655) (25). Similarly, Bellieni CV et al., assessed the agreement of NIPS and PIPP between three nurses and found that NIPS had better interobserver reliability than PIPP (26).
Most neonatal pain assessment scales assess babies’ facial expressions, although some also include elements like crying, limb movement, and vital indicators. Real-time pain assessment requires dynamic nursing monitoring rather than an instantaneous operation. As a result, frequent pain assessment is time consuming and labour-intensive. The results can be influenced by various factors, including subjective differences in observers, interruptions from other clinical procedures, a lack of time, gender differences, neonatal activity interference, etc. [27, 28]. Therefore, another alternative could be to utilise Artificial Intelligence (AI) as a neonatal pain-expression-recognition technology. The automated detection of newborn pain expressions has progressed from static photos to dynamic films and from theoretical research to system implementation, making AI-based Neonatal Pain Assessment (AI-NPA) possible. On one hand, AI-NPA may compensate for the inadequacies of onsite NPA performed by medical staff, and it may offer the benefits of simplicity and efficiency. To create a model with strong anti-interference capabilities and great resilience for real-world data, AI-NPA requires a huge amount of precisely classified neonatal pain data. Cheng XC et al., developed an automated NPA system for NIPS and found highly consistent readings with onsite measurement (27).
The authors considered the total PIPP-R score, as well as its subcomponents to assess agreement. The sample size is reasonably good for the present study. Videos were assessed so that the evaluators had ample time to do the scoring. Different care professionals were involved in the present study.
The present study was a single-centre study. Purposive sampling of assessors was done to select the volunteers. Having only two participants from each profession may not be representative. The videos were evaluated for agreement rather than real-time bedside assessment.
The agreement between consultants and other healthcare workers, as well as lay persons, was deemed unacceptable in both the total PIPP-R score and facial expression score. Even after considering the subcomponents of the PIPP-R score, such as behaviour state, change in oxygen saturation, and change in HR, the Kappa value was not impressive, confirming poor agreement. Contrary to expectations, the learning curve for PIPP-R scoring appears to be steep, suggesting that persistent efforts and experience are required to master this skill, hence rejecting the hypothesis. This fact is supported by the better agreement observed between experienced nurses and consultants. Conducting multicentric agreement studies utilising different frontline healthcare workers will help strengthen the evidence.
Date of Submission: Feb 27, 2023
Date of Peer Review: Apr 22, 2023
Date of Acceptance: Sep 13, 2023
Date of Publishing: Nov 01, 2023
• Financial or Other Competing Interests: None
• Was Ethics Committee Approval obtained for this study? Yes
• Was informed consent obtained from the subjects involved in the study? Yes
• For any images presented appropriate consent has been obtained from the subjects. NA
PLAGIARISM CHECKING METHODS:
• Plagiarism X-checker: Mar 02, 2023
• Manual Googling: May 12, 2023
• iThenticate Software: Sep 11, 2023 (5%)
ETYMOLOGY: Author Origin
- Emerging Sources Citation Index (Web of Science, thomsonreuters)
- Index Copernicus ICV 2017: 134.54
- Academic Search Complete Database
- Directory of Open Access Journals (DOAJ)
- Google Scholar
- HINARI Access to Research in Health Programme
- Indian Science Abstracts (ISA)
- Journal seek Database
- Popline (reproductive health literature)