help button home button ClinMed NetPrints
Warning: This article has not yet been accepted for publication by a peer reviewed journal. It is presented here mainly for the benefit of fellow researchers. Casual readers should not act on its findings, and journalists should be wary of reporting them.

This Article
Right arrow Abstract Freely available
Right arrow Similar articles in this netprints
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Paetz, Jür.
Right arrow Articles by Hanisch, E.
Right arrow Search for Related Content
Right arrow Articles by Paetz, Jür.
Right arrow Articles by Hanisch, E.
Related Collections
Right arrow CLINICAL:
Critical Care / Intensive Care

clinmed/2003070002v1 (February 25, 2004)
Contact author(s) for copyright information

Breaking the paradigm: Scores are of no clinical relevance for predicting outcome in abdominal septic shock patients


Jürgen Paetz, doctor of information science, Björn Arlt, master of information science, Katharina Holzer, associate professor of surgery, Rüdiger Brause, associate professor of information science, Albrecht Encke, professor of surgery, Ernst Hanisch, professor of surgery


J. Paetz, Department of Surgery, Klinikum der J.W. Goethe-Universität Frankfurt am

Main; B. Arlt, Department of Surgery, University Hospital, University Frankfurt am Main; K. Holzer, Department of Surgery, University Hospital, University Frankfurt am Main; R. Brause, Institute of Information Science, University Frankfurt am Main; A. Encke, Department of Surgery, University Hospital, University Frankfurt am Main; E. Hanisch, Department of General, Visceral and Endocrine Surgery, Asklepios Hospital Langen


Corresponding author: Prof. Dr. Dr. E. Hanisch, Department of General, Visceral and Endocrine Surgery, Asklepios Klinik Langen, Röntgenstraße 20, 63225 Langen, Germany;                        


Grant: This work was supported by Deutsche Forschungsgemeinschaft (HA 1456/7-1,2;


Running head: Scores and outcome prediction in septic shock patients 


Purpose of the study

Scores are widely used for predicting outcome in critically ill patients. This study addresses two questions: 1. How do scores perform in abdominal septic shock patients?  and 2. Can the performance of scores be improved by using neural network models?

Basic procedures

We compare the outcome performance of a neural network  (NN) for different data sets with that of common scores (SOFA, APACHE II, SAPS II, MODS) by ROC analysis in 382 abdominal septic shock patients who were identified and documented prospectively.  

Main findings

ICU mortality is 49 %.  In the first three days of the ICU stay, SOFA, APACHE II, SAPS II and MODS attain AUC values of  0.54 [95 % CI 0.41,0.60]; 0.52 [0.41,0.64]; 0.52 [0.46,0.58]; 0.52 [0.46,0.59], respectively. NN performs similarly (0.52 [0.46,0.58]). AUC values are high only in the last three days of the ICU stay (SOFA 0.89, [0.83,0.96], APACHE II 0.79 [0.70,0.89], SAPS II 0.85 [0.77,0.92], MODS 0.88 [0.77, 0.99] and NN  0.88, [0.83,0.92] ). With NN it is possible to attain a performance equivalent to that of the SOFA score by considering the three variables systolic and diastolic blood pressure and the number of thrombocytes only.  

Principal conclusion

In patients suffering exclusively from abdominal septic shock, scores are of no clinical relevance for outcome prediction, since AUC values are very low in the early ICU period. A neural network analysis cannot improve this performance. Nevertheless, a neural  network data analysis derived from the last three days of the ICU stay is generated as the basis of a new web-based alarm system, which is currently under evaluation in a randomised, prospective multicenter study.

The complete database is fully accessible under  


Key words

Septic shock, scores, neural networks, alarm system


Since the description of sepsis by Schottmüller in 1914 [1], knowledge concerning sepsis and its underlying pathophysiology has increased substantially. Epidemiologic examinations of  abdominal septic shock patients show a high risk potential of sepsis resulting from extensive treatment in the intensive care unit (ICU) [2].  Unfortunately, it has not been possible to reduce the rate of mortality of septic shock up to now. It is still as high as 50-60% worldwide, although PROWESS' results [3] are encouraging.

The heterogeneity of patient groups and the variations in therapy strategies is seen as one of the main problems for sepsis trials. Therefore, commonly available scoring systems are used for comparing critically ill patient groups. Moreover, one of the main objectives of scores is to provide information relevant to predicting outcome. In this study, a group of 382 patients exclusively comprising abdominal septic shock cases was investigated for the first time with the aid of several established scores (SOFA, APACHE II, SAPS II, MODS). In addition, data was further analysed using a multi-dimensional neuronal network model.


The data of 382 patients who met the consensus criteria for septic shock  [4,5] was analysed to predict outcome using most of the vital parameters and medication doses (metric variables) commonly documented. Data was collected prospectively in patients from German hospitals from 1998 to 2001. The data of 382 handwritten patient records were transferred to an electronic database. We used range and plausibility checks to preclude faulty data in the electronic database. 187 of the 382 patients died (49%).



a)         SOFA (Sepsis-Related Organ Failure Assessment) [6], [7]: the SOFA score assesses organ malfunction (respiratory, cardiovascular, renal, liver, neurological) and clotting disorders on a scale of 0 to 4 in whole-number values. The sum of these values for the individual organs is designated the SOFA score. Ten variables are needed to calculate the score.

b)         APACHE II (Acute Physiological and Chronic Health Evaluation) [8]: APACHE II is the score for outcome prognosis of ICU patients assessing acute disorders, age and overall health (on a scale of 0 to 71 of whole-number values).

c)         SAPS II (Simplified Acute Physiology Score) [9]: The SAPS II score is another ICU score using only 13 variables. Originally, SAPS was introduced as a simplified APACHE score.

d)         MODS (Multiple Organ Dysfunction Score) [10]: The MODS score assesses organ (respiratory, liver, renal, , heart, neurological) as well as clotting states on a whole-number scale.

The Glasgow Coma Scale (GCS) [11] is not included in the SOFA and MODS score since it was not always available. A score was calculated whenever the necessary variables were available.

Neural network

The supervised neural network algorithm [12] - used in its modified, improved variant [13] - uses the class information of the data in its adaptation process. Outcome labels are used {survived, deceased} as class information in the training procedure of the neural network. This kind of system adapts a non-linear classification to the data by modifying rectangular basis functions. For implementation details, see [14]. The result is similar to a nonlinear regression, but a regression model is not required a priori.


Data sets

The database comprises 382 septic shock patients. The metric data contained in the database consists of daily measurements and medication. Different periods of time were considered for the experiments presented in Fig. 1: F3 (first three days of ICU stay), S3 (first three days after the occurrence of septic shock), ALL (every day of ICU stay), D6-8 (days 6,7 and 8, counted from the last day of the ICU stay, i.e. the last day of the ICU stay would be day 0), D2-4 (days 2,3 and 4 counted from the last day of ICU stay), L5, L3, L2, L1 (last 5, 3, 2, 1 day(s) of ICU stay). Since the results on the admission day and the day after admission were almost random (AUC = 0.5), we used a minimum of three days (S3) for AUC calculation.

Besides the score data sets, the following data sets are taken into account (units are only specified once): 

freq16: (the 16 most frequently measured variables) heart rate [1/min], systolic blood pressure [mmHg], diastolic blood pressure [mmHg], temperature [degrees C], CVP [mmHg], O2 saturation [%], leukocytes [1000/µl], haemoglobin [g/dl], haematocrit [%], thrombocytes [1000/µl], PTT [s], sodium [mmol/l], potassium [mmol/l], creatinine [mg/dl], glucose [mg/dl], urine volume [ml/h],

blood clotting: leukocytes, erythrocytes [1000/ml], haemoglobin, haematocrit, thrombocytes, TPT [%], PTT [s], thrombin time [s], AT3 [%], fibrinogen [mg/dl], total protein [g/dl], glucose [mg/dl], RBC [ml], FFP [ml],

heart: heart rate, systolic blood pressure, diastolic blood pressure, CVP, cristalloids [ml], colloids [ml], adrenaline [µg/kg/min], noradrenaline [µg/kg/min], dopamine [µg/kg/min], dobutamine [µg/kg/min],

lungs: arterial pO2 [mmHg], arterial pCO2 [mmHg], base excess [-],

bicarbonate [mmol], O2 saturation, O2 medication [l/min], peak [cmH2O], I:E [-], breathing rate [1/min], FiO2 [%], PEEP [mmHg],

bac: (breathing and catecholamines) FiO2, PEAK, breathing rate, adrenaline, noradrenaline, dopamine, dobutamine,

bpt: systolic and diastolic blood pressure, thrombocytes

and the single variables systolic blood pressure, diastolic blood pressure, thrombocytes.

The main preprocessing steps are [11]: sampling of data within a 24 hour interval (mean values over 24h for each variable) and removal of missing values (replaced by random values within the interval of the so-called interquartile range from a suitable normal distribution that was determined separately for every variable). It is important to replace random missing values so as not to distort the performance.


Experimental conditions and statistics

All (one-dimensional) score samples and the (multi-dimensional) samples of the data sets are classified by the neural network. For the one-dimensional scores, the classification is merely linear using an optimal threshold. Training was done with 50% of the samples and testing with the remaining 50%. Data on training patients was not used for testing (disjoint patient sets). All experiments with one data set were repeated twenty times for robust estimation of mean and standard deviation. The area beneath the ROC curve (AUC) is used to compare classification performance.

We calculated 95% confidence intervals (CI) for AUC values for our neural network analysis assuming that AUC values in one data set calculated in repetitions of an experiment are normally distributed. Using exploratory statistics (Q-Q plots), this is a reasonable assumption. If m is the mean value of AUC values and s the standard deviation, a normal distribution N(m,s) is converted into a standard normal distribution N(0,1) by T(x) = (x-m)/s, cf. [15], p.109. The inverse transformation is given by T-1(y) = m+sy. The 0.9750 quantile of N(0,1) is given by y=1.9600 (or the 0.0250 quantile is given by y=-1.9600), see [15], pp. 890-891. Therefore, the upper bound UB for the 95% CI is given by m+1.96s and the lower bound LB by m-1.96s, resulting in a 95% CI [LB , UB].



An epidemiological overview of the data is given in Table I. The performance of the neural network is now compared with score performance.


Neural network performance

With the data set freq16 (Fig. 1), AUC values  AUC =0.56 (F3) and  AUC = 0.59 (S3) are achieved. The average AUC value is 0.65 (ALL) comprising the samples of all days. The best classification results are achieved when considering the last day L1 (AUC = 0.93). Since an outcome prognosis on the last day is not useful for creating an alarm system, we consider L3 with a high AUC (AUC = 0.90) and a three-day prognosis horizon.

In Fig. 2, the area beneath ROC curves (AUC) is shown for different data sets (last 3 days of

ICU stay L3).

Score performance

The three scores MODS, SAPS II and APACHE II perform differently (AUC = 0.88, 0.85 and 0.79), when considering time period L3 (Fig. 3) with APACHE II performing worst. The SOFA score (AUC = 0.89) results in the best classification. In addition, considering the first three days (F3) the AUC for the SOFA score equals 0.54,  for APACHE II = 0.52, SAPS II = 0.52, MODS = 0.52, Neural Network = 0.52.


The confidence intervals for AUC values of scores are presented in Table II. The 95% confidence intervals (CI) of AUC for the scores in Fig. 3 are long, e.g. the CI range from 0.77 to 0.99 for MODS.

CI for neural network results are mostly shorter than the CI of the score, e.g. the range for the bpt CI is [0.83,0.92].

Result A is significantly higher (95% CI) than result B if the lower bound LB of interval A's confidence interval is higher than the upper bound UB of interval B's confidence interval.

For the results of freq16 data sets (Fig. 1, 1st column of Table II), L1, L2, L3, D2-4 and L5 have significantly higher AUC than F3, S3 and ALL. The systems lungs, heart, bpt and SOFA have similar CIs and AUCs.


The alarm system created

From 138 patients, we have generated an alarm system using neural network results. This system was then applied to analyse the results of the extended group of 382 patients. An alarm  signal was actuated whenever input for the neural network generates high output for the class "deceased." We obtained the best classification results using the lungs, heart, bpt or SOFA data set (last three days). Since only three variables served as input in the bpt system we use bpt data. This facilitates bedside input by physicians. Fig. 4 shows the alarm percentage that results for the first three days, for the first and second half of the ICU stay and for the last three days, indicated separately for patients who either died or survived. In the time periods 1, 2, 3 and 4, there are 34%, 23%, 9% and 7% alarms for surviving patients, respectively, and 41%, 36%, 57% and 72% alarms for deceased patients, respectively, (i.e. 1.2, 1.5, 6.5 and 9.9 times more alarms for deceased patients, respectively). Only alarms deriving from the last three days can be interpreted as false alarms with respect to outcome prediction; 7% were false alarms. On the other days, one cannot establish retrospectively whether the alarms are due to the presence of critical or uncritical states. Patients may be in numerous critical or uncritical states, independent of their outcome. We did not interpret alarms for patients who survived within the other time periods as false alarms, as surviving patients may also trigger the alarm when they are critical during their stay in the ICU. For this reason, alarms for surviving patients can be called "false alarms" only ex post facto.


Most clinicians can identify patients in septic shock. However, if asked, they will give a hundred definitions [16], despite the fact that consensus conferences might be considered to have resolved this issue [4, 5]. In this paper we use the term "septic shock" stringently. The term "severe sepsis" is deliberately avoided, since we could demonstrate in a previous study that "severe sepsis" comprises almost identical patients when considering exclusively abdominal septic shock [17] .

Different scoring systems have been developed not only in order to document the severity of the illness, but also to estimate the prognosis of these critically ill patients.

The best outcome predictor would be one that warns the physician on the first day of ICU admission or when septic shock is first manifested (this is usually the second day of the patient's ICU stay according to our analysis). Our results demonstrate that none of the scoring systems achieves this goal. Scores attain acceptable AUC values only in the last three days of the ICU period. The SOFA score based  on ten variables achieves the best AUC of all scores. The data-driven neural network approach performs in a similar way to the SOFA score, using only three variables (bpt).

Although the scores and the neural network under investigation provide relevant outcome prediction information only in the last three days of the ICU stay of patients  (i.e. they are without clinical relevance), it is worthwhile to examine the data more closely.   

The CI values in Table II show that scores are difficult to apply to individual patients: a score value does not indicate death or survival with high confidence; this results in long CIs. The neural network results with respect to the non-score data sets are more reliable since CI length is usually shorter. The SOFA score has the lowest interval length (0.13) of all the scores, so this is therefore the best score for abdominal septic shock patients. For example, SOFA's CI length is 0.13, NN's CI length is only 0.09. Considering all data sets (e.g. lungs, heart, bpt, freq16), the results show the superiority of neural networks over scores when considering the reliability of a classification of individual patients. 

The resulting alarm system based on our analyses generates reliable alarms (during the last three days of the ICU stay, there were ten times more alarms for deceased patients than for survivors).  The alarm system that was trained with data of the last three days represents the patient conditions that lead with a high probability to death or survival. Although the alarm system was trained with data of the last three days, it can be used as an online bedside alarm system. Right from the start of the patients' ICU stay, physicians are warned when patients reach the same critical condition as that reached by deceased patients within the last three days.

In April 2002, a prospective randomised multicentric study was initiated to check the clinical usefulness of the web-based alarm system (see study protocol at

The complete data base is accessible via




[1] Schottmüller H. Wesen und Behandlung der Sepsis. Inn Med, 1914;31:257–280.

[2] Hanisch E, Encke A. Intensive care management in abdominal surgical patients with septic complications. In Faist E, ed. Immunological screening and immunotherapy in critically ill patients with abdominal infections. Berlin: Springer-Verlag, 2001;71–138.

[3] Bernard GR, Vincent J-L, Laterre PF, et al. Efficacy and safety of recombinant human activated Protein C for severe sepsis. N Engl J Med, 2001;344:699–709.

[4] Bone RC, Balk RA, FB Cerra, et al. American college of chest physicians/society of critical care medicine consensus conference: definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis. Crit Care Med, 1992;20:864–875.

[5] Levy MM, Fink MP, Marshall JC, et al. 2001 SCCM/ESICM/ACCP/ATS/SIS International Sepsis Definitions Conference. Crit Care Med, 2003;31 :1250-1256

[6] Vincent JL, Moreno R, Takala J, et al. The SOFA (sepsis-related organ failure assessment) score to describe organ dysfunction/failure. Intensive Care Med, 1996;22:707–710.

[7] Vincent JL, de Mendonca A, Cantraine F, et al. Use of the SOFA score to assess the incidence of organ dysfunction/failure in intensive care units: Results of a multicenter, prospective study. Crit Care Med, 1998;26(11):1793–1800.

[8] Knaus WA, Draper EA, Wagner DP, et al. APACHE II: A severity of disease classification system. Crit Care Med, 1985;13(10):818–829.

[9] Le Gall Jr, Lemeshow S, Saulnier F. A new simplified acute physiology score (SAPS II) based on a European / north American multicenter study. JAMA, 1993;270:2957–2963.

[10] Marshall JC, Cook DJ, Christou NV, et al. Multiple organ dysfunction score: A reliable descriptor of a complex clinical outcome. Crit. Care Med, 1995;23(10): 1638–1652.

[11] Jennett B, Teasdale G. Assessment of coma and impaired consciousness: A practical scale. Lancet, 1974;1:81–84.

[12] Huber KP, Berthold MR. Building precise classifiers with automatic rule extraction. Proc. of the IEEE Int. Conf. on Neural Networks, 1995;3:1263–1268.

[13] Paetz J. Metric rule generation with septic shock patient data. Proc. of the 1st IEEE Int. Conf. on Data Mining, 2001;637–638.

[14] Brause R, Hamker F, Paetz J. Septic shock diagnosis by neural networks and rule based systems. In Schmitt M, et al., eds. Heidelberg: Computational intelligence processing in medical diagnosis. Physica, 2002;323–356.

[15] Hartung J. 11th ed. Munich, Oldenbourg: Statistik; 1998.

[16] Feature: Septic shock – Finding the way through the maze. Lancet, 1999; 354: 2058

[17] Wade, S, Büssow M, Hanisch E. Epidemiology of SIRS, sepsis and septic shock in surgical ICU patients. Chirurg, 1998; 69: 648-655

Table I:    Epidemiological data of 382 abdominal septic shock patients. Artificial respiration duration was averaged only for patients that were respirated.




all pat.

all patients

382 (100%)

male patients

222 (58%)

female patients

160 (42%)







number of patients















age [years]








ICU stay [days]








artificial respiration [days]








weight [kg]








height [m]










Table II:   95% confidence intervals (CI) for AUC of all data sets.


Data set Fig. 1

95% CI

Data set (L3)

Fig. 2

95% CI

Data set (L3)

Fig. 3

95% CI


[0.50 , 0.62]


[0.70 , 0.90]


[0.77 , 0.99]


[0.53 , 0.65]


[0.74 , 0.89]


[0.77 , 0.92]


[0.59 , 0.70]


[0.84 , 0.93]


[0.70 , 0.89]


[0.70 , 0.81]


[0.78 , 0.88]


[0.83 , 0.96]


[0.81 , 0.92]


[0.88 , 0.96]




[0.82 , 0.92]


[0.86 , 0.94]




[0.86 , 0.94]


[0.83 , 0.92]




[0.86 , 0.97]


[0.76 , 0.94]




[0.87 , 0.99]


[0.86 , 0.94]







Fig. 1





Fig. 2





Fig. 3






Fig. 4


Figure 1:    Freq16 data: Area beneath ROC curves (AUC) for different periods of time of the ICU stay.


Figure 2:    AUC values for different data sets (last three days of the ICU stay).


Figure 3:    Area beneath ROC curves (AUC) for MODS, SAPS II, APACHE II and SOFA (last three days of the ICU stay).


Figure 4:    Alarm rate in percent for 1) the first three days; 2) the first half of the ICU stay; 3) the second half of the ICU stay; 4) the last three days.


This Article
Right arrow Abstract Freely available
Right arrow Similar articles in this netprints
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Paetz, Jür.
Right arrow Articles by Hanisch, E.
Right arrow Search for Related Content
Right arrow Articles by Paetz, Jür.
Right arrow Articles by Hanisch, E.
Related Collections
Right arrow CLINICAL:
Critical Care / Intensive Care