Assessing the maintenance processes of a software organization: an empirical analysis of a large industrial project

Assessing the maintenance processes of a software organization: an empirical analysis of a large industrial project

The Journal of Systems and Software 65 (2003) 87–103 www.elsevier.com/locate/jss Assessing the maintenance processes of a software organization: an e...

194KB Sizes 0 Downloads 22 Views

The Journal of Systems and Software 65 (2003) 87–103 www.elsevier.com/locate/jss

Assessing the maintenance processes of a software organization: an empirical analysis of a large industrial project Andrea De Lucia a

a,*

, Eugenio Pompella b, Silvio Stefanucci

a,*

Department of Engineering, RCOST––Research Centre on Software Technology, University of Sannio, Palazzo Bosco Lucarelli, Piazza Roma, Benevento 82100, Italy b EDS Italia Software S.p.A. Viale Edison, Loc. Lo Uttaro, Caserta 81100, Italy Received 15 October 2001; received in revised form 26 February 2002; accepted 4 April 2002

Abstract The use of statistical process control methods can determine the process capability of sustaining stable levels of variability, so that processes will yield predictable results. This enables to prepare achievable plans, meet cost estimates and scheduling commitments, and deliver required product functionality and quality with acceptable and reasonable reliability. We present initial results of applying statistical analysis methods to the maintenance processes of a software organization rated at the CMM level 3 that is currently planning the assessment to move to the CMM level 4. In particular, we present results from an empirical study conducted on the massive adaptive maintenance process of the organization. We analyzed the correlation between the maintenance size and productivity metrics. The resulting models allow to estimate the costs of a project conducted according to the adopted maintenance processes. Model performances on future observations were assessed by means of a cross validation which guarantees a nearly unbiased estimate of the prediction error. Data about the single phases of the process were also available, thus allowing to analyze the distribution of the effort among the phases and the causes of variations.  2002 Elsevier Science Inc. All rights reserved. Keywords: Massive software maintenance; Statistical process control; Cost estimation models

1. Introduction Software organizations have only recently started to appreciate the value of applying statistical process control (SPC) techniques (Florac and Carleton, 1999). Late and over budget software procurements are well known as large scale software problems. The use of SPC methods can determine the process capability of sustaining stable levels of variability, so that processes will yield predictable results (Florac and Carleton, 1999). This enables to prepare achievable plans, meet cost estimates and scheduling commitments, and deliver required product functionality and quality with acceptable and reasonable reliability. * Corresponding authors. Tel.: +39-0824-305839; fax: +39-0824305840. E-mail addresses: [email protected] (A. De Lucia), eugenio. [email protected] (E. Pompella), [email protected] (S. Stefanucci).

The Software Engineering Institute has published a capability maturity model (CMM) that can be used to rate an organization software process maturity on a five levels scale (Paulk et al., 1991). Moving an organization from one level to the next one depends on the capability of the organizationÕs software process to address key practices that accomplish key process areasÕ goals. The application of rigorous quantitative management and SPC techniques are the main practices to be addressed to meet the goals of the CMM level 4 key process areas. At level 3, metrics are collected, analyzed, and used to control the process and to make corrections to the predicted costs and schedule, as necessary. At level 4, measurements are quantitatively analyzed to control process performances of the project and to develop a quantitative understanding of the quality of products to achieve specific quality goals. In other words, the emphasis at level 3 of the CMM is on well established process and metrics, while the emphasis at level 4 is on statistical control. This provides means for determining

0164-1212/03/$ - see front matter  2002 Elsevier Science Inc. All rights reserved. doi:10.1016/S0164-1212(02)00051-1

88

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

process stability and predictability as well as quantitatively establishing process capability to meet criteria for process effectiveness and efficiency: in short, the basis for predicting the process behavior and for making decisions about the process improvement. Organizations achieve control over their products and processes by reducing the variation in their process performances in an acceptable quantitative range. The motivation behind the CMM is that a mature software process will deliver the product on time, within budget, within requirements, and of high quality. Compliance to the CMM is becoming increasingly important in contracting for development and maintenance of US government software. The underlying intent of the CMM stresses adding business value by correctly applying quantitative management concepts and techniques to software processes and products. Therefore, as organizations progress through the ‘‘Repeatable’’ (2) and ‘‘Defined’’ (3) levels of the CMM, the adoption of statistical techniques makes business sense: if an organization is ready and willing to progress from level 3, the higher levels of process maturity (Level 4 and 5) offer the greatest return on process improvement investments. In this paper we present initial results of applying statistical analysis methods to the maintenance processes of the Solution Center setup in Italy (in the town of Caserta) by EDS Italia Software, a major international software enterprise. Recently, this solution center has achieved the CMM level 3 and is currently planning the assessment to move to the CMM level 4. Most of the business of the Solution Center concerns maintaining third party legacy systems. Legacy systems evolve to meet ever changing user needs (Lehman and Belady, 1985). Changes may be driven by market pressure, adaptation to new environments or situations, or improvement needs. During its life-cycle a legacy system is continuously subject to ordinary maintenance, that includes the interventions that must be reactively implemented to maintain the system in operation, i.e. corrective maintenance and smaller forms of adaptive and perfective maintenance (IEEE Std. 1219, 1998). Periodically, legacy systems are subject to extraordinary maintenance (De Lucia et al., 2001a; Pigoski, 1997); examples include migration (Brodie and Stonebaker, 1995) and massive adaptive maintenance (Jones, 1999). The size and impact of such interventions require the instantiation of specific projects that might require a considerable effort and staff to be completed in a useful time-to-market and whose risks can be compared to software development project risks. A further complexity element is the fact that generally extraordinary maintenance projects are conducted in parallel with the ordinary maintenance of the legacy system, thus causing synchronization and configuration management problems. In particular, in the last few years a large number

of massive adaptive maintenance projects have been conducted all around the world, involving an amazing amount of legacy systems (Jones, 1999). Generally, the reason for massive adaptive maintenance interventions derives from the rise of new requirements, as in the case of the introduction of the EURO currency, or from the invalidation of assumptions made at development time, as in the case of the Y2K problem (Kl€ osch and Eixelsberger, 1999). Typically, the scope of a massive adaptive maintenance intervention is the entire system, although usually standard solutions previously identified can be used in most of the impacted points (Lynd, 1997). It is worth noting that, maintenance organizations will face with new massive maintenance projects in the future (Jones, 1999). Recently, many organizations have been involved in the second phase of the EURO problem. Examples to be faced in the future are the modification of the telephone numbers in the USA by the 2010, the modification of the social security numbers by the 2050 and again the Y2K problem, as in most projects the solution adopted still maintains two digits for the year (Lynd, 1997). In this paper we present results from an empirical study on a large Y2K project conducted by the EDS Solution Center (EDS SC). The aim was to analyze and assess the stability of the adopted massive maintenance process and to provide indications for future projects. We analyzed the correlation between the maintenance size and productivity metrics. The resulting models allow to estimate the costs of a project conducted according to the adopted maintenance processes. Model performances on future observations were assessed by means of a cross validation which guarantees a nearly unbiased estimate of the prediction error. Data about the single phases of the process were also available, thus allowing to analyze the distribution of the effort among the phases and the causes of variations. The empirical study has been conducted within the project ‘‘Virtual Software Factory’’, a research project conducted by EDS Italia Software with academic partners aiming at designing and experimenting a technical and organizational system supporting software development and maintenance in a cooperative networking environment. Within this project several related studies concerning the use of SPC techniques are being carried out to deal with the problems of cost estimation for corrective maintenance processes (De Lucia et al., 2001b), staffing, process management and service level evaluation (Di Penta et al., 2001), and to analyze the estimation of staff, duration, and communication between distributed and collocated projects (Bianchi et al., 2001). The paper is organized as follows. Section 2 discusses related work. Sections 3 and 4 present the massive maintenance process adopted by EDS SC and an overview of the analyzed project. Section 5 presents the

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

results of analyzing the correlation between the maintenance size metrics and effort, while Section 6 analyzes the stability of the process through the distribution of the effort among the phases. Concluding remarks and lessons learned are outlined in Section 7.

2. Related work SPC includes a set of techniques and tools which help characterize patterns of variations. By understanding these patterns, a business can determine sources of variation and minimize them, resulting in a more consistent and robust product and service. Regression analysis is an intermediate technique used mainly for estimation of effort and schedule from size measures (Wheeler and Chambers, 1991). A number of research papers have been published describing the development and evaluation of formal models to assess and predict software maintenance task effort. Moreover, in our experience few are concerned with massive adaptive maintenance projects. Effort estimation is a crucial part and a key factor for successful software project planning and management. This is demonstrated by the variety of modeling approaches proposed in the context of software development (Boehm, 1981; Wellman, 1992) ranging from decomposition techniques (top down, bottom up), analogy based estimations, algorithmic approaches (statistical, theoretical) to hybrid or mixed approaches, combining algorithmic methods and expert judgments. Stensrud and Myrtveit (1999) demonstrate that human performances are increased when a tool based either on analogy or on regression models is available. However, their work is not generalized to maintenance activities. Software development and software maintenance are two different activities that have different characteristics because the focus of software maintenance is the change of existing software and not the creation of new software. Analogy models (Shepperd et al., 1996) are based on historical data and previous project experience: the current software project is compared with one or more previous similar projects carried out in the same organization to relate their costs. They can be viewed as a procedure of encapsulating previous experience to produce a historical maintenance database from which appropriate information are extracted. The tool Angel (Shepperd et al., 1996) aims to automate the search for analogies process: similar cases are computed by evaluating their distance in terms of mathematical norms, e.g. Euclidean distance. Algorithmic approaches involve the construction of mathematical models from empirical data following a well defined procedure. Boehm (1981) presents one of the first approaches to estimate maintenance effort; he

89

extends his COCOMO model for development costs to the maintenance phase through a scaling factor. This factor, named annual change of traffic, is an estimate of the size of changes expressed as the fraction of software total LOC which undergo change during a typical year. Other work extends the COCOMO model introducing new metrics and factors (Boehm et al., 1987). Smith et al. (2001) analyze the impact of four task assignment factors on software development effort and derive an estimation model using a smaller number of factors; Hastings and Sajeev (2001) propose a vector size measure which is used to estimate the development effort, after its substitution in BoehmÕs cost model equation. Belady and Lehman (1972) propose a model for estimating the effort required to evolve a system from a release to the next. Their model has two terms accounting respectively for progressive activities, enhancing system functionality, and anti-regressive activities, compensating negative effects of system evolution (Lehman and Belady, 1985). Ramil (2000) uses historical data about different versions of the same application to calibrate effort models and assess their predictive power. The best linear regression models extracted are based on coarse-grained metrics (number of added, updated, and deleted sub-systems). Lindvall (1998) also shows that coarse-grained metrics, such as the number of classes, have stronger accuracy than other finer-grained metrics on predicting changes. On the other hand, Niessink and van Vliet (1997, 1998) use regression analysis and a size measure based on Function Points to predict maintenance effort. Their results show that the size of the component to be changed has a larger impact on the effort than the size of the change. Eick et al. (2001) consider software system evolution to define code decay and a related number of measurements. These indexes are used to produce a prediction model quantifying the effort required to implement changes. Jorgensen (1995) compares the prediction accuracy of different models using regression, neural networks (Hertz et al., 1991), and pattern recognition approaches (Briand et al., 1992). In particular, the last approach involves the decomposition of a previous experience data set to select those data that are more similar to the current case and are consequently a good starting point for the estimate. All the models assume the existence of a size estimate, measured as sum of added, updated, and deleted LOC during a maintenance task, which can be accurately predicted and sufficiently meaningful. He uses an industrial data set consisting of various maintenance tasks from different applications within the same organization. Caivano et al. (2001) present a method and a tool for the estimation of the effort required in software process execution. They validate it on a legacy system renewal project and show that fine granularity and model recalibration are effective for

90

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

improving model performance and verify that the estimation model is process dependent. Relative effort distribution between different phases is a useful method for characterizing and examining the evolutional trends of a software process. It permits fast estimates for the effort of a single process phase as a relative portion of the whole effort. However, in our knowledge there are not many works analyzing effort distribution between the phases of a software maintenance process. An interesting work made by Mattsson (1999) presents quantitative effort data from a six years object oriented application framework project. The study analyzes the effort distribution per phase and version of the framework, identifying evolutional trends and giving quantitative support for the claim that framework technology delivers reduced application development effort. Joshi and Kobayashi (1995) developed two process models for textual and graphical HDL-based design in order to measure HDL-based design productivity. They measured the effort required for each design activity and analyzed the effort distribution over various design activities. Basili et al. (1996) present the results of a case study to understand and estimate the cost of maintenance releases of software systems. An incremental approach is used to better understand the effort distribution of releases and build a predictive effort model for software maintenance releases. Moreover, the study provides a set of lessons learned about the maintenance program.

3. The massive adaptive maintenance process In this section we describe the massive adaptive maintenance process adopted by EDS SC for the analyzed project. From 1996 EDS SC has conducted several Y2K and EURO projects. Most of the projects started quite early and were completed with large advance with respect to the final deadline. The experience of these projects was used to assess and improve the process in view of some large Y2K project committed later and that only started at the beginning of 1999. Such a project required a tremendous effort and a very close deadline. To this aim EDS SC adopted a process based on a preliminary assessment phase aiming to decompose an application portfolio into loosely coupled sets of cohesive components called work-packets to be independently and incrementally modified and delivered. The decomposition of applications into work-packets opens the way to incremental massive adaptive maintenance processes and presents a number of advantages: • the project risks are reduced, as failures are restricted to a single work-packets, and are more reliably manageable, as they are spread among the work-packets; • the effort can be distributed among different parallel teams working on different sites, as the tremendous

effort required to complete the project and the close deadline makes it impossible to staff the project on a single site; • a specific sub-project with a relatively small team and a short time to delivery can be instantiated for each work-packet. This approach reduces the freezing zone 1 of a modified software component, as all the components of a work-packet can be put in operation once the massive adaptive maintenance process for the work-packet is completed, and allows to more easily estimate the total staff required for the entire project (to be decomposed in several teams), depending on the schedule constraints; • for large projects, the identification of work-packets can proceed incrementally, while processing already identified work-packets; this allows to convert and deliver work-packets as soon as they are identified, thus immediately testing in the customer environment the results of the conversion process. Overall, the massive adaptive maintenance process used by EDS SC is based on a waterfall model composed of six phases: portfolio inventory and assessment, analysis, design, implementation, testing, and delivery installation. Table 1 shows the inputs, the activities, and the outputs for each phase of the maintenance process. It is worth noting that testing activities such as test plan and test design are performed at analysis and design time, respectively, although their costs are imputed to the testing phase. The first phase of the process is the portfolio inventory and assessment: the application portfolio is analyzed and decomposed into applications and workpackets. Typically, this phase is conducted by software analysts together with the different application owners, i.e., technical personnel of the customer organization that have business and technical knowledge of the application to be maintained. Questionnaires and interviews are used together with analysis of the available documentation to identify the components belonging to the different applications. Cohesion and coupling metrics are also used to decompose applications into workpackets. A consistency check based on dependency analysis is required to identify if all the components of a work-packet have been correctly included (completeness) and if any component is not relevant to the work-packet or also relevant to other work-packets. Components that are relevant to more than one workpacket (e.g., a copy component) give rise to specific schedule and configuration management problems and may require the coupled work-packets be adapted by 1 We call freezing zone of a modified software component the time period the component is not subject to the ordinary maintenance. During this period the version of the component in operation can only undergo emergency corrective maintenance.

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

91

Table 1 Massive maintenance process description Input

Activity description

Output

 Application portfolio  Base solution strategy (e.g., windowing approach for Y2K conversion)

Analysis of the software application portfolio

Application portfolio inventory report

 Application portfolio  Base solution strategy  Application portfolio inventory report

Hierarchical decomposition of the system (the software system is decomposed in applications, each application in clusters, each cluster in work-packets)

Functional decomposition diagram (including work-packet decomposition)

 Base solution strategy  Naming conventions (defined by the SQA)

Base search criteria definition (needed for identifying the estimated impact set)

Base search criteria

 Work-packet  Base solution strategy

Identification and location of the work-packet software components

 Work-packet software component decomposition  Baseline input table

 Work-packet software components  Baseline input table

 Consistency check  Completeness check  Formal delivery of the programs to the maintenance team

 Configuration audit report  Maintenance team delivery report

 Base search criteria  Naming conventions

Definition of the search criteria specific for the current work-packet

Work-packet search criteria

 Work-packet software components  Work-packet search criteria

Identification of the estimated impact set

Work-packet estimated impact set

 Solution strategy  Work-packet estimated impact set

Definition of guidelines for the modification strategy

Modification strategy guidelines

 Work-packet search criteria  Base solution strategy

Test plan

Test plan document

 Work-packet estimated impact set  Work-packet software components with candidate impacts

Identification of the actual impact set

Work-packet actual impact set

 Work-packet actual impact set  Modification strategy guidelines  Base solution strategy

 Production of the technical specifications for the conversion of the actual impacts  When needed, production of ad hoc solution strategies

Work-packet technical specifications

 Test plan document  Work-packet technical specifications

Test design

Test design specifications (including test case and test procedure specifications)

 Work-packet technical specifications  Work-packet software components with actual impacts

Modification of the impacted software components

Modified programs

Modified programs

Compilation of the programs

Compiled programs

Test execution

 Tested programs  Test summary report (continued on next page)

Portfolio inventory and assessment phase

Analysis phase

Design phase

Implementation phase

Testing phase  Compiled programs  Test design specifications

92

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

Table 1 (continued) Input

Activity description

Output

Production of reports containing information about modified work-packet software components

Baseline output table

Delivery/installation phase  Modified software components  Baseline input table

 Baseline input table  Baseline output table  Modified software components

 Completeness check on baseline output table  Consistency check between baseline input and output tables  Formal delivery of the modified software components of the work-packet  Installation of the modified software components

parallel teams and then integrated together. In this phase the base search criteria for identifying the estimated impact set (Arnold and Bohner, 1993) and the standard solution strategies are also defined. Indeed, most of the impacts of an adaptive maintenance intervention can be typically solved using a pre-defined standard solution pattern. The remaining phases of the adaptive maintenance process are conducted for the different work-packets in a relatively independent way. The analysis phase consists of identifying the estimated impact set based on the criteria defined in the inventory phase. Typically, the starting impact set is first identified, consisting of the program points directly impacted by the maintenance request; then the estimated impact set is constructed augmenting the starting impact set with the points impacted by ripple effects (Arnold and Bohner, 1993). In this phase guidelines for a possible modification strategy are also defined. The test plan document for the workpacket software components is also produced at analysis time. The main goal of the design phase is to identify the actual impact set (Arnold and Bohner, 1993) of the work-packet, and to define the solution strategies to be adopted. All impacts in the estimated impact set are analyzed and classified; candidate impacts that do not represent actual impacts are discarded. Each actual impact requires a change to the source code: typically for each actual impact a standard solution strategy is selected; if none of the available standard solution templates can be applied to an actual impact, an ad hoc solution strategy has to be defined. Nonstandard impacts increase the work-packet modification costs; however, usually they constitute a low percentage of the actual impacts. The test design specification document is also produced during the design phase. In the implementation phase the work-packet is modified by applying to each impact the corresponding solution strategy. The testing phase is performed at different levels. At the unit level tests are executed on single units; these can be programs, but for changes without ripple effects it can be more efficient to extract

 Configuration audit report  Delivery and installation report

code fragments located around the impact by program slicing (Tip, 1995; Weiser, 1984) and test them using suitable drivers. This strategy, that we call isolation testing, is particularly efficient in the case of standard impacts as the drivers can be constructed from a predefined standard template. The final phase of the adaptive maintenance process is delivery and installation. Whenever, during the workpacket freezing zone the corresponding version that was in operation underwent some emergency corrective maintenance, the new version of the work-packet is accordingly changed before delivery.

4. The analyzed project The process defined in the previous section was applied by EDS SC in a Y2K remediation project for a large application portfolio composed of about 40,000 software components, including COBOL and PL/1 programs, COPY components, ASSEMBLER programs, CICS maps, data base description components, JCL procedures, and so on. About 15,000 components were modified, including 7082 programs and 6850 JCL procedures. The project started on 2 January 1999 and finished on 14 January 2000. The total number of workpackets was 123, the total effort spent was 457 manmonths, the average staff was 146 people. The maximum peak, reached in March 1999, was 179 people; altogether, 253 different people were employed. The project was conducted by maintenance teams distributed on three different sites and has been constantly held under control Table 2 Work-packet maintenance size metrics Metric

Description

SC CI AI SAI NSAI TC

Number Number Number Number Number Number

of of of of of of

software code components candidate impacts actual impacts actual standard impacts nonstandard actual impacts test cases

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

through the systematic application of opportune practices, including the Internal Quality Audit conducted by the department of Software Quality Assurance. Due to the short time available to complete the project, all maintenance teams included skilled people with medium and high experience on Y2K remediation projects. Several maintenance size metrics were collected during the different phases of each work-packet conversion sub-project, as shown in Table 2: number of software components SC; number of candidate impacts CI (i.e. the size of the estimated impact set); number of actual impacts AI (i.e. the size of the actual impact set), divided in number of standard actual impacts SAI (i.e. impacts resolved with a standard technical solution pattern) and number of nonstandard actual impacts NSAI; number of test cases TC. Also, productivity metrics, namely effort, duration, and staff, were collected for each phase of a work-packet conversion sub-project, as shown in Table 3 (it is worth noting that the costs of the portfolio inventory and assessment phase are imputed to the different work-packets identified). For assessment purposes, during the portfolio inventory phase different metrics were collected for the programs of a sample Table 3 Work-packet productivity metrics Variable

Description

Effort Staff Duration

Actual effort measured as man-days Total number of employed maintainers Actual duration measured as number of calendar days

Table 4 Descriptive statistics of the work-packet maintenance size metrics Metric

Min

Max

Mean

StdDev

SC CI AI SAI NSAI TC

2 8 1 0 0 1

1243 6646 504 461 122 9382

364.43 1109.84 40.31 25.21 15.10 702.14

334.26 1490.49 91.28 76.34 22.12 1876.95

Table 5 Descriptive statistics of the productivity metrics Metric

Min

Max

Mean

StdDev

Effort Staff Duration

3 2 3

278 27 127

56.654 6.95 27.925

54.244 4.375 20.465

93

subset of 16 work-packets (about 10% of the system), including number of LOC and complexity metrics. To have an idea of the size of the system in terms of LOC, the size of this subset of work-packets was about 1,350,000 LOC. It is worth noting that not all the values of the selected metrics were available for each work-packet, probably due to the fact the measures were manually recorded. So, we adopted a row-wise deletion method to align the data set: we had to exclude from our analyses data of about 40 work-packets presenting missing values. The number of work-packets our analysis can rely on is still a considerable number and constitute about 70% of the project data. Tables 4 and 5 show the descriptive statistics for the collected metrics. On the average a work-packet involves the instantiation of a sub-project requiring a total staff of 7 maintenance programmers, an effort of 56 man-days, and a duration of 28 days. The smallest sub-project was accomplished in 3 days by 2 maintenance programmers (a team leader and a team member), while the largest one required a team of 27 people, an effort of 278 man-days, and was completed in 127 days. Table 6 shows the values of the metrics for sample work-packets converted by teams of different sites. The number of software components SC is a good metric for the size of a work-packet. Its quantification is not particularly difficult, because it is a coarse-grain metric consisting of an enumeration and classification of all the programs belonging to a work-packet. Likely, this is the reason for which this metric presents a smaller number of missing values than other maintenance size metrics. Its large range of values is a consequence of the technique used for the decomposition of the application portfolio in work-packets. Indeed, the goal of the decomposition was to obtain highly decoupled workpackets that can be independently and simultaneously converted. Therefore, the decomposition caused a high variance in the size of the work-packets, producing both small work-packets (with a very low number of software components SC) and large work-packets (see Table 4). The high variance in the distribution of the work-packet size is propagated to the subsequent phases of the maintenance process, although the lack of a high correlation of SC with other metrics, as shown in Table 7. This lack of correlation is due to the characteristics of the different work-packets: in fact, often a work-packet includes business applications that implement a well

Table 6 Examples of work-packet metric values WP-1 WP-2 WP-3

SC

CI

AI

SAI

NSAI

TC

Effort

Staff

Duration

575 1239 319

246 710 330

21 115 22

12 56 8

9 59 14

292 993 206

108 154 83

12 17 3

30 60 40

94

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

Table 7 Metrics correlation matrix SC CI AI SAI NSAI TC Effort

SC

CI

AI

SAI

NSAI

TC

Effort

1.00 0.51 0.44 0.36 0.68 0.40 0.76

0.51 1.00 0.66 0.58 0.81 0.67 0.68

0.44 0.66 1.00 0.99 0.77 0.98 0.70

0.36 0.58 0.99 1.00 0.66 0.98 0.63

0.68 0.81 0.77 0.66 1.00 0.75 0.81

0.40 0.67 0.98 0.98 0.75 1.00 0.67

0.76 0.68 0.70 0.63 0.81 0.67 1.00

defined functional application area of the software system that can have a high or low dependency on the use of data fields, independently of the work-packet size. It is worth noting that the number of candidate and actual impacts was also largely independent of the program size (LOC) and most of the candidate impacts were not actual impacts. For this reason program size and complexity metrics were not considered as important factors affecting productivity and were not collected for all the work-packets. Indeed, previous studies (Lindvall, 1998; Ramil, 2000) demonstrated that coarsegrained metrics are more accurate than finer-grained metrics for building effort models and assessing their predictive power. According to these studies we also expect that the number of software components influences the effort of a work-packet, as software components need to be classified during the inventory phase and analyzed during the analysis phase to identify candidate impacts through the use of automatic tools (see the correlation between the number of software components SC and the effort in Table 7). The number of candidate impacts CI is the most important metric affecting the effort of the design phase, where candidate impacts are manually analyzed to identify actual impacts. For this metric we found the highest number of missing values; probably this was due to the fact that this metric does not affect the latest phases of the massive maintenance process (implementation and testing) and then often maintenance programmers did not pay very much attention to it. We also found a significative correlation of CI with the total effort spent on a work-packet (see Table 7), probably because the design phase is the most human intensive phase and less supported by automatic tools. The actual impacts AI and its components SAI and NSAI are collected and documented better than CI. We can immediately note the high drop from the values of CI to the values of AI. This is due to the fact that the identification of candidate impacts in the analysis phase is conducted in a redundant way aiming at capturing all the points actually impacted by the new date format: this strategy produces many potential impacts that are not actual impacts. Also, this drop results in a higher relative variance in the distribution of the AI values (see Table 4). We expect that actual impacts are also good

predictors for the work-packet effort (see the correlations with the effort in Table 7). The same considerations can be made for the number of test cases TC; however, we did not use this metric as effort predictor, due to the high correlation with the number of actual impacts. Finally, it is worth to remember that most of the actual impacts AI are standard and this explains the highest correlation of the number of standard impacts SAI with AI, with respect to the correlation between the number of nonstandard impacts NSAI and AI (see Table 7). This is also evidenced by the fact that the distribution of SAI has the same characteristics as the distribution of AI and TC, while the range of the NSAI values is smaller and their distribution presents a lower relative variance. To avoid ripple effects and reduce the number of impacts, the standard technical solution adopted in the project was a windowing solution (Lynd, 1997). This allowed to prepare in advance templates and use automatic tools supporting the implementation and testing phases for actual impacts, that constitutes the majority of actual impacts. The size of the change for each standard actual impact is about 10 added/modified/ deleted LOC, while for nonstandard actual impacts it ranges between 30 and 80 added/modified/deleted LOC.

5. Effort estimation models The project described in the previous section was used to analyze and assess the performances of the process and in particular to identify the relationships between the costs, as identified by the effort required to modify and make Y2K compliant a work-packet, and the maintenance size, measured through the metrics shown in Table 2. The derived cost models can be used by project managers to estimate the total effort required for all the phases of a work-packet conversion process. The effort for the delivery/installation phase is not computed by the models as this strongly depends on external factors of the customer environment. 5.1. Empirical approach We focus on ordinary multivariate least squares regression (Meyers, 1986; Stuart and Old, 1991) as the

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

modeling technique for effort estimation since this is one of the most common modeling techniques used in practice (Briand et al., 2000; Gray and MacDonnell, 1997). Furthermore, the literature has recently shown evidence that ordinary least squares regression is as good as or better than many competing modeling techniques in terms of prediction accuracy (Briand et al., 2000; Jeffery et al., 2000). In particular, we tried different multivariate linear models: Y ¼ b1 X 1 þ    þ b n Xn

ð1Þ

where each Xi is a size metric in Table 2 or some function of Xi . 2 An interesting point is whether an intercept term b0 should be included in the model. Such a term would suggest the existence of an effort type not directly related to the variables being included in the model. However, from the p-value for the significance test of the regression coefficient b0 we have statistical evidence to state that an intercept value is not needed. Thus, the intercept term b0 was not considered in our models. To evaluate the performances of the different models we used two distinct classes of evaluation criteria. The first class is based on statistics developed from the statistical analysis of the variance, such as the coefficient of determination R2 that represents the percentage of variation in the dependent variable explained by the independent variables of the model (Stuart and Old, 1991). The R2 does not assess the quality of future prediction, only the capability of fitting the sample data. A class of evaluation criteria that assesses the quality of future prediction is based on residual analysis. A residual is the difference between an observed value of the dependent variable y and the value predicted by the model. The predictive performances of the models were assessed using a leave-one-out cross validation approach (Bradley and Gong, 1983), based on the examination of residuals yi  y^i , where yi is the value of the ith value of the dependent variable as observed in the data set and y^i is the predictive value from a regression equation trained with all observations except the ith. Thus, in a sample data set of size n, there are n separate regression equations, each obtained from n  1 observations and tested on the withheld datum. The accuracy measures are averaged on n and the choice of the best model is determined by selecting the one with the lowest value (Meyers, 1986). Using the leave-one-out cross validation approach we computed the mean relative error MRE and the following variants of the measure PRED (Bradley and Gong, 1983; Jorgensen, 1995): 

PRED25 ¼ the % of the work-packets with relative error RE 5 0:25:

2 The transformation function might be required to verify the normality assumption on the data set, necessary to apply the regression analysis (Stuart and Old, 1991).

95

 PRED50 ¼ the % of the work-packets with RE 5 0:50:

Finally, robust regression techniques were also explored to handle nonobvious outliers (Stuart and Old, 1991). Outliers not only influence the estimation of the regression coefficients, they can also have an even larger effect on standard errors, t-tests, R2 , and other regression statistics. Ordinary least squares analysis does not perform well when outliers occur. In this case we say that it is not resistant to changes in one or two observations. A robust estimate is one that is resistant to even drastic changes in one or two observations. Several families of robust estimators have been developed. The robust methods fall into the family of M-estimators. This type of maximum-likelihood estimator minimizes the sum of a function of the residuals. M-estimation is usually approximated through the use of iteratively reweighted least squares. The applied robust method uses least absolute deviation as influence function on residuals. The data that demonstrate a significant departure on the residual values were rejected. After the outlier removal the models were recomputed. 5.2. Building the effort estimation models The consideration outlined in the Section 4 and the aim of producing a cost model that is simple to use and early applicable in the life cycle of a work-packet conversion project suggested to build a model including only the number of software components SC. This metric has the advantage to be easily measured and to be available almost immediately. The characteristics of the resulting regression model are shown in the first row of Table 8. The model does not seem to fit well the data, as demonstrated by the R2 . Also, its predictive performances are not reasonable for a concrete use as effort estimation model, as shown from cross validation results (see last three columns of Table 8). It is worth noting that results did not improve using functional transformations of the independent variable. The fact that we were not able to build any suitable model based only on the number of source code components can be imputed to the fact that actually the number of source code components mainly affect the first two phases of the process, that are less effort demanding with respect to the latter three phases. Therefore, we included other metrics in the analysis, such as the number of candidate impacts CI and the number of actual impacts AI, and constructed other models. The drawback is that these metrics are not available at the beginning of the work-packet conversion process and therefore their values have to be estimated in advance to appropriately use the cost estimation model. In addition, our analysis has demonstrated that the number of candidate and actual impacts cannot be simply estimated from the number of

96

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

Table 8 Total effort estimation model parameters and performances Model

bi (Coeff.)

p-value

R2

MRE (%)

PRED25 (%)

PRED50 (%)

SC

0.12256

2.55E11

0.59

85.42

20.21

45.74

SC CI

0.14109 8.925E-03

1.64E11 2.94E02

0.84

54.32

34.88

60.46

SqrtSC sqrtAI

2.45253 7.02285

2.52E10 5.76E10

0.91

64.98

30.71

72.60

sqrtSC sqrtNSAI sqrtSAI

2.26257 4.66005 7.11513

5.31E08 6.10E04 3.65E05

0.92

60.99

48.38

75.80

software components, as it can be also deduced from the lack of correlation between these metrics (see Table 7). Based on these observations, the second effort estimation model we constructed was based on the number of software components SC and the number of candidate impacts CI as independent variables. The candidate impacts are available quite early at the end of the analysis phase and are automatically computed using tools. In this way, rough estimates produced at the beginning of the project using the model based only on the number of software components can be refined. The second row of Table 8 shows the model parameters and its predictive performances. The model fits reasonably well the data, as demonstrated by the R2 ; also, the model predictive performances are acceptable although not excellent (Vicinanza et al., 1991): the mean relative error is about 54% and the model predicts about 35% of the data with a relative error below 25% and about 60% of the data with a relative error below 50%. We also investigated the construction of models including the number of actual impacts. We did not try to construct a model including both the number of candidate impacts and the number of actual impacts, as the latter are obtained from the former during the design phase of the process. In addition, there is a correlation between CI and AI, although this is not very strong (see Table 7). We identified two models as shown in the last two rows of Table 8: a model including the variables SC and AI, and a model including the variables SC, SAI, and NSAI (the prefix sqrt indicates a root square transformation on the data). As shown in Table 8, both the models perform better than the model including the number of candidate impacts CI, although as expected the multivariate model with three variables seems to better explain the data (R2 ¼ 0:92). This model also exhibits the best predictive capability, as shown in Table 8: it predicts nearly half of the cases with an acceptable RE (<25%) and reaches 75% of the cases with a RE within 50%. However, the bivariate model including the number of software components SC and the number of total actual impacts AI exhibits almost the same performances of the trivariate model (except for the PRE-

D25 measures), but it is easier to use at the beginning of the process, as it only needs to estimate the total number of actual impacts, rather than the number of its standard and nonstandard components. However, both the models can be used at the end of the design phase, to refine the estimates produced with the first two models. We also applied robust regression analysis to investigate the presence of outliers. In the first two models, one based on the number of software components SC and the other based on SC and the number of candidate impacts CI there is no evidence of outliers. The two models based on the number of actual impacts presented outliers (one outlier for the model based on AI and three outliers for the models based on SAI and NSAI). After discarding the outliers, the estimates obtained for these models improved all the parameters, although the improvement was marginal due to the low number of outliers. 5.3. Using the effort estimation models As said in the previous section, the model based on the number of software components is the easiest to use, although it is the less reliable. The main problems with using the other more reliable models is that the numbers of candidate and actual impacts are only available at the end of the analysis and design phases, respectively. Therefore, it would be more appropriate to use the model based on the number of software components SC at the beginning of the process, to achieve a rough estimate of the total effort; then it would be reasonable to exploit the availability of the metrics available at the end of the analysis and design phases of the process, to refine and improve the early estimate with the more accurate models. Indeed, the effort estimation is an activity that must be continuously performed in all the phases of the software process, to verify and, eventually, improve the previous estimation, as soon as more reliable information is available (Pressman, 1997). The main weakness of this incremental approach is the too rough and unreliable initial estimate of the model based on the number of software components. To

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

97

Table 9 Effort estimation model for inventory and analysis phases Model

bi (Coeff.)

p-value

R2

MRE

PRED25

PRED50

SC

0.03682

1.52E-07

0.83

44.99%

25.58%

54.65%

reduce the risks related to unreliable estimates of the total effort, we tried to construct a more accurate model based on the number of software components to estimate only the relatively lower effort of the inventory and analysis phases. This reduces the prediction error and the associated risks. The model parameters and the predictive performances are shown in Table 9. This model fits better the data with respect to the model predicting the total effort of the work-packet (R2 ¼ 0:83) and has a much lower and reasonable mean relative error (about 45%). An improvement is also achieved in the PRED measures. Another advantage of this model is that the estimates can be soon verified with its actual values; in this way the related decisions and adjustments to the process, like for example about resource and time scheduling, can be taken with a greater reliability derived from the knowledge of the first two phases of the process and the new estimates based on the new data available.

6. Effort distribution analysis In this section we analyze the data about the distribution of the effort to the phases of the massive maintenance project. We examined the existence of a relationship between the effort of the different phases. The correlation matrix in Table 10 evidences high correlation between the effort of the subsequent phases. This suggests that it is possible to have a significant distribution of the effort among the phases.

Table 11 shows the relative distribution of the effort among the different phases. This distribution seems reasonable and consistent with our expectance. From the interviews with some software engineers involved in the process, we argued that: • The low percentage for analysis and implementation phases are explained by the fact that they are well structured and supported by automatic tools. It is worth to remember that the candidate impacts are obtained in a conservative way by means of software tools. • The most expensive and crucial phase is design. Indeed, in this phase there is a large amount of effort needed for identifying and selecting actual impacts from candidate impacts. This also explains the fact that the number of candidate impacts CI is a good metric to predict the overall effort for a work-packet (see Table 7 and the second row of Table 8). • The percentage for the testing phase is lower than suggested by other authors (Lynd, 1997; Interesse and Dabicco, 1999). The reason for this is likely that there is a low integration effort due to the decomposition into work-packets and to the adoption of the isolation testing method at the unit level. Another possible reason for this is the fact that often testing fills the remaining scheduled time available and therefore is not conducted in a systematic way. This fact is also evidenced by the lack of correlation between the number of test cases TC and the effort of the testing phase, shown in Table 12.

Table 10 Phase correlation matrix Inventory Analysis Design Implementation Testing

Inventory

Analysis

Design

Implementation

Testing

1.00 0.84 0.87 0.58 0.49

0.84 1.00 0.85 0.66 0.49

0.87 0.85 1.00 0.73 0.54

0.58 0.66 0.73 1.00 0.82

0.49 0.49 0.54 0.82 1.00

Inventory

Analysis

Design

Implementation

Testing

13.2% 5.9% 11.6% 0.45

11.1% 3.9% 10.9% 0.36

38.4% 14.0% 36.8% 0.36

11.3% 5.7% 12.4% 0.51

25.5% 13.4% 30.5% 0.53

Table 11 Effort distribution Mean StDev Median COV

98

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

Table 12 Correlation between the effort of the single phases and the maintenance size metrics Effort

SC

CI

AI

NSAI

SAI

TC

Inventory Analysis Design Implementation Testing

0.75 0.77 0.80 0.65 0.45

0.58 0.57 0.74 0.68 0.39

0.46 0.51 0.67 0.70 0.48

0.58 0.56 0.76 0.82 0.55

0.39 0.45 0.58 0.61 0.42

0.40 0.46 0.62 0.64 0.33

Fig. 1. Effort distribution box plot.

The data in Table 11 shows that the mean and median of the testing phase are not very close and the standard deviation is relatively high (see also the coefficient of variation, COV). This is also better evidenced in the box plot of Fig. 1. The boxes are plotted as follows (Stuart and Old, 1991): the width of the box is arbitrary; the top and bottom of the box are the 25th and 75th percentiles. Therefore, the length of the box is the interquartile range (IQR), that is the box represents the central 50% of the data distribution; the line drawn through the middle of the box is the median (the 50th percentile). The adjacent values are displayed as T-shaped lines that extend from each end of the box. The upper adjacent value is the largest observation that is less than or equal to the 75th percentile plus 1.5 times IQR. The lower adjacent value is the smallest observation that is greater than or equal to the 25th percentile minus 1.5 times IQR. The observations that fall between the adjacent values and the boxes might include outliers (these values are called mild outliers), while the observations that fall outside the adjacent values range more likely represent outliers (these values are called severe outliers).

The coefficient of variation is also high for the inventory and implementation phase, although in this case the median and mean are reasonably close. However, it is worth noting that the inventory phase is stable and the high COV in Table 11 depends on the instability of the terminal phases of the process. Table 13 shows the distribution of the effort without considering the testing phase. The data clearly indicates a lower COV for the inventory phase, while the COV for the implementation phase is still high. However, the testing phase must be considered less stable than the implementation phase as demonstrated by the distance between median and mean in Table 11. The data suggests that most likely there are outliers or the distribution is skewed. To identify outliers we used a T 2 test with probability value a ¼ 0:05 based on the Mahalanobis distance of each point from the mean (Stuart and Old, 1991). Fig. 2 shows the box plot for the distribution of the effort when the outliers are discarded from the data set (six outliers were identified, three of them related to the testing phase). Comparing with Fig. 1 it can be clearly noted the improvement of the distri-

Fig. 2. Effort distribution box plot without outliers.

Table 13 Effort distribution without testing Mean StdDev Median COV

Inventory

Analysis

Design

Implementation

17.6% 6.3% 17.5% 0.36

14.9% 4.0% 14.7% 0.27

50.9% 11.6% 54.0% 0.23

16.4% 9.1% 18.0% 0.55

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

99

Table 14 Effort distribution without outliers Mean StdDev Median COV

Inventory

Analysis

Design

Implementation

Testing

11.5% 3.5% 10.9% 0.31

10.9% 2.8% 10.7% 0.26

38.1% 10.9% 36.7% 0.29

11.7% 3.6% 12.5% 0.31

27.7% 10.7% 30.9% 0.39

bution, in particular with reference to the testing phase: the median is closer to the mean and there are acceptable values for COV in all the phases (see Table 14). A closer examination of the outliers revealed some useful and interesting characteristics to understand the cause of the variance of the distribution: all the outliers concerned work-packets including a large number of candidate impacts CI and a relatively small number of actual impacts AI. This characteristic directly explains the high variance of the effort distribution in the implementation and testing phases. The high drop from CI to AI results in a small number of AI that very often requires a little implementation and testing effort, because there are few modifications to implement and test. Therefore, for these observations the effort for implementation and testing is relatively lower than the effort of the other phases, producing lower percentage values. We also investigated the effects of these outliers on the effort prediction models. First, we analyzed the predictions of the models on the outliers: we verified that all the predictions fell within an acceptable low range of relative error (only one outlier overcomes the 25%). Then, we recalibrated the models discarding the outliers: the results were not interesting as there was not relevant improvement of the models. The reason is due to the fact that the outliers for the distributions are not outliers for the effort prediction models, as the models work well for these observations too. Indeed, the little effort required for the implementation and testing phases is compensated by a larger effort required in the design phase to analyze a large number of CI. However, even discarding the outliers for the effort distribution, the testing phase still remain the less stable phase, as shown in Fig. 2 and Table 14. This means that there must be some other causes for the variation in the distribution. Indeed, the instability of the testing phase is also demonstrated by the correlation between the effort of the different phases and the maintenance size metrics (see Table 12). We expected that every collected metric had a clear and intuitive influence on a particular phase of the process. For example the number of candidate impacts CI mainly influences the design phase. Also, we expected that the number of test cases TC or the number of actual impacts AI had a relatively low influence on the inventory and analysis phases and a strong influence on the implementation and testing phases. Indeed, all the correlation values seem reason-

able and there are not particularities or anomalies, with the exception of the testing phase, where there is no correlation between the effort and any maintenance size metric, including TC. The lack of correlation between TC and the effort for the testing phase can have different possible explanations. A first explanation is that the testing phase is the most human intensive and scarcely automated and likely problems encountered in the different work-packets required different effort to be solved, not directly correlated with the number of test cases. However, this explanation requires more investigations, as at the present it is not supported by data. A more suitable explanation is the adoption of the practice that compresses the remaining work required to complete projects that are late on their schedule. Indeed, most of the variation is due to work-packets with a short percentage of effort for the testing phase (see Fig. 1), even after outliers removal (see Fig. 2). Therefore, it is likely that the maintenance teams reduced the effort of the testing phase by executing non systematic and simplified variants of the standard procedures. Indeed, the testing process for massive maintenance projects is well defined by the organizationÕs standard practices. Probably, these practices were rigorously applied in the analyzed project to select the test cases, as it is clearly evidenced by the strong correlation between the number of actual impacts AI and the number of test cases TC (see Table 7), but they were not followed to execute the test cases.

7. Discussion and conclusion Quality professionals spend a lot of time talking about measurement and the ability of information technology projects to estimate times effectively. SPC techniques can determine sources of variation distinguishing among variations caused by normal process operation and variations caused by anomalies in the process. For software processes at CMM levels 2 and 3, and those entering CMM level 4, SPC is dominated by the process control chart (Paulk and Chrissis, 2000). Large scale variations attributable to inconsistencies and noise in the software process can be observed, tracked, and eliminated through the careful use of basic statistics and the control chart. By the time an organization has

100

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

matured its software processes further, the process variances attributable to inconsistency and noise have been largely eliminated. Remaining variation is attributable to finer scale measures among interdependent variables. As a result, regression analysis has become the dominant SPC statistical technique over control charts (Paulk and Chrissis, 2000). Regression analysis is used to build estimation models whose accuracy is a measure of the predictability and, hence, stability of the process analyzed. In our study we have applied statistical methods for producing effort estimation models and verifying their prediction performances. This has enabled us to assess the maturity and robustness of the studied maintenance process from the predictability, and consequently repeatability, of the effort model. Indeed, the results of the regression models demonstrate a good repeatability and predictability of the effort required for a maintenance project. Although the results are preliminary and more studies are required, this can be considered as an indicator of the maturity and stability of the analyzed process. The derived models also improved previous cost estimation models adopted by the organization. The most confident estimation model currently adopted by the organization for the massive maintenance process is mainly based on the system size expressed as function points and has a low degree of fit with the observations: it produces many high errors and results unsuitable for the estimation. Moreover, the model is not based on the decomposition of the system into work-packets. Our regression analysis has produced different effort prediction models. The best model uses three independent variables, namely number of source code components, number of standard actual impacts, and number of nonstandard actual impacts. Although the mean relative error MRE is about 60%, the model has interesting values for the PRED measures: it predicts nearly half of the cases with an acceptable RE (<25%) and reaches 75% of the cases with RE within 50%, that can be considered good (Vicinanza et al., 1991). Indeed, the experience shows that accurate prediction is difficult: an average error of 100% can be considered ‘‘good’’ and an average error of 32% ‘‘outstanding’’ (Vicinanza et al., 1991). However, using the trivariate model for cost estimation at the beginning of the process is not easy, as it may be difficult to estimate the number of standard and nonstandard actual impacts. For this reason, more practical bivariate models can be preferred based on the number of source code components and either on the total number of actual impacts or on the number of candidate impacts, although these may be less accurate. Indeed, to use these bivariate models at the beginning of the process, only one variable has to be estimated based on the project manager experience on similar projects:

either the number of actual impacts or the number of candidate impacts, that will only be available later in the process. Our analysis has demonstrated that although the model based only on the number of source code components is the easiest to use, its performances are not very good and the estimates would not be very accurate, as the number of source code components is a better predictor for the initial phases of the process, rather than for the terminal phases. The bivariate model based on the number of software components and on the total number of actual impacts exhibits almost the same performances of the model with three variables and then could be used in alternative to the trivariate model. On the other hand, the model including the number of software components and the number of candidate impacts exhibits the worst performances. It predicts 35% of the cases with RE within 25% and reaches 60% of the cases with RE within 50%, although it exhibits the lowest mean relative error (54%). However, this model can be used early during the process, as the number of candidate impacts is known at the end of the analysis phase. It is worth noting that the effort spent in the first two phases of the process (inventory and analysis) is relatively low (about 24% of the total effort including outliers and about 22% without considering outliers in the distribution) and that most of the effort is spent in the design phase (about 38% of the total effort), where candidate impacts are analyzed to identify actual impacts. Therefore, a practical use of the derived estimation models would be to use the number of source code components to achieve an initial rough estimate of both the total effort and the effort of the first two phases of the process; then the other bivariate models can be used at the end of the analysis and design phases, respectively, to refine the initial estimates, once the number of candidate impacts and the number of actual impacts, respectively, are available. Indeed, the effort estimation is an activity that must be continuously performed in all the phases of the software process, to verify and, eventually, improve the previous estimation, as soon as more reliable information is available (Pressman, 1997). Regression analysis was not used to identify the causes of variation that can negatively impact on the process execution, but only to verify that the range of variation of the process performances (e.g. the relative error of prediction) was acceptably low. On the other hand, an attempt has been made to use statistical methods for better understanding the process and discerning the causes of variation of the massive maintenance process. We have used simpler SPC techniques such as box plot and other basic statistics and realized that the high variance of the percentage of the effort in the terminal phases of the process is not due to anomalies of the implementation and testing phases, but to the limitations of the overall process of massive main-

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

tenance. The high variance is in part due to the reduction of a very large number of candidate impacts to a very small number of actual impacts in the design phase. This implies high percentages of effort in initial phases, in particular in the design phase and relatively small percentages in the terminal phases. Unfortunately, this drop of the number of impacts is not predictable. It is worth noting that although the distribution of the effort was not balanced for these work-packets, the relative error of the cost estimation model was still low when applied to them, that means that the total effort is still predictable in these cases. However, further investigations revealed that other possible causes of variation can be the testing problems of different nature occurring in the conversion of different work-packets and the practice of reducing the effort of the terminal phases for work-packets that were late on their schedule. All these explanations confirm that testing is the most complex and less stable phase of a massive maintenance process (Lynd, 1997; Interesse and Dabicco, 1999). Finally, we can outline some general lessons learned from our study: • It is mandatory understanding variation, not using specific SPC tool: A high maturity organization should choose the appropriate statistical or modeling technique to answer questions and meet goals. Statistical techniques can be time consuming and counterproductive if applied too soon and/or to unreliable low maturity data and/or in inadeguate ways. Control charts are a powerful statistical tool, but many other rigorous techniques are possible, such as analysis of variance and regression analysis. In particular, in our study we used regression analysis for the construction of the estimation models and simpler techniques as box plot for analyzing the effort distribution among the phases of the massive maintenance process. • Measurement should be homogeneous and driven by specific business goals: In our study we did not feel the problem of inhomogeneous data. The data set provided by the organization was collected without changes of the technology used for process execution and the way of measuring metrics was clearly defined without ambiguity. But this is not always the case. An operational definition and a uniform measurement way must be defined for each metric collected, to avoid misleading and inconsistent data. Moreover, it is very important identifying the relevant process measures. These measures should be linked to the business drivers and the process characteristics should drive the selection of the possible set of measures and determine their operational definitions (Basili and Rombach, 1988). The metrics used in our study were very general and not driven by specific business goals. Conversely, business goals should be

101

taken into account when devising a metric plan in future projects: this is a prerequisite to understand and control the critical aspects of the processes and to aim at process improvement. • Disaggregating process control data should be a good practice: This problem is a direct consequence of the scarce consideration of business drivers when devising a metric plan. Aggregated data is too variable and can be misleading. In the massive maintenance process the number of non standard actual impacts clearly requires a larger effort with respect to standard actual impacts. However, the available data did not provide a distinction between the effort spent for the different types of impacts. • Data collection should be frequent and automated: The frequency of data collection should be high enough to provide real time control of the process and, therefore, allow people to take immediately action on the data. The data collection at CMM level 3 is of little value with respect to the use and need of data at CMM level 4. As the organization matures the data must be collected more frequently and in more detail, with increasing accuracy and granularity. This, obviously, increases the costs of data collection, but also the quality and effectiveness of data analysis. Automation of data collection is an activity that should be in place before getting to CMM level 4, so that the quantitative focus of level 4 does not add overheads to the implementation of the software process. In our study we found several observations with missing values for some metrics. Therefore, we had to perform row-wise elimination on the data set before applying regression analysis, as the observations containing missing values were unsuitable for the statistical analysis. The last issue is a very important problem, especially for small data sets. Likely, the main cause of this trouble is the manual recording of the measures that can be tedious and felt not very important in organizations at lower maturity levels. We auspicate the adoption of automated supports to collect process metrics. Within the ‘‘Virtual Software Factory’’ project we are investigating a number of technologies that enable the management of distributed software processes and the cooperation among the software engineers within a networking environment. An example of such technologies is workflow management (Georgakopoulos et al., 1995; Maurer et al., 2000). We are currently developing and experimenting prototypes to support the corrective and the massive maintenance processes (Aversano et al., 2001a, 2001b). Future work will be devoted to experiment these prototypes within real projects. The data collected automatically will be used to evaluate the impact of workflow management tools on the process performances.

102

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103

Acknowledgements We would like to thank A. Cimitile for the precious suggestions on the work described in this paper. A special thank goes to A. Persico and A. Pannella for helping in collecting and analyzing the data. The work described in this paper is supported by the project ‘‘Virtual Software Factory’’, funded by Ministero dellÕUniversit a e della Ricerca Scientifica e Tecnologica (MURST) and jointly carried out by EDS Italia Software, University of Sannio, University of Naples ‘‘Federico II’’, and University of Bari.

References Arnold, R.S., Bohner, S.A., 1993. Impact analysis––toward a framework for comparison. In: Proceedings of International Conference on Software Maintenance, Montreal, Canada. IEEE Computer Society Press, Los Alamitos, CA, pp. 292–301. Aversano, L., Betti, S., DeLucia, A., Stefanucci, S., 2001a. Introducing workflow management in software maintenance processes. In: Proceedings of International Conference on Software Maintenance, Florence, Italy. IEEE Computer Society Press, Los Alamitos, CA, pp. 441–450. Aversano, L., Canfora, G., Stefanucci, S., 2001b. Understanding and improving the maintenance process: a method and two case studies. In: Proceedings of the 9th International Workshop on Program Comprehension, Toronto, Canada. IEEE Computer Society Press, Los Alamitos, CA, pp. 199–208. Basili, V., Briand, L., Condon, S., Kim, Y.M., Melo, W.L., Valen, J.D., 1996. Understanding and predicting the process of software maintenance releases. In: Proceedings of 18th International Conference on Software Engineering. IEEE Computer Society Press, Los Alamitos, CA, pp. 464–474. Basili, V.R., Rombach, H.D., 1988. The Tame project: forwards improvements oriented software environments. IEEE Transactions on Software Engineering 14 (6), 758–773. Belady, L., Lehman, M., 1972. An introduction to program growth dynamics. In: Freiberger, W. (Ed.), Proceedings of Conference on Statistical Computer Performance Evaluation. Academic Press, New York, NY, pp. 503–511. Bianchi, A., Caivano, D., Lanubile, F., Rago, F., Visaggio, G., 2001. Distributed and colocated projects: a comparision. In: Proceedings of 7th IEEE Workshop on Empirical Studies of Software Maintenance. Florence, Italy, pp. 65–69. Boehm, B., 1981. Software Engineering Economics. Prentice-Hall Inc., Englewood Cliffs, NJ. Boehm, B., Clark, B., Horowitz, E., Westland, C., Madachy, R., Selby, R., 1987. Cost models for future software life cycle processes: COCOMO 2.0. Annals of Software Engineering 1, 57– 94. Bradley, E., Gong, G., 1983. A leisurely look at the bootstrap, the jack-knife and cross-validation. The American Statistician 37 (1), 836–848. Briand, L., Basili, V., Thomas, W.M., 1992. A pattern recognition approach for software engineering analysis. IEEE Transactions on Software Engineering 18 (11), 931–942. Briand, L., Langley, T., Wieczorek, I., 2000. A replicated assessment and comparison of common software cost modeling techniques. In: Proceedings of 22nd International Conference on Software Engineering, Limerick, Ireland. ACM Press, New York, NY, pp. 377– 386.

Brodie, M.L., Stonebaker, M., 1995. Migrating Legacy Systems––Gateways, Interfaces & Incremental Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA. Caivano, D., Lanubile, F., Visaggio, G., 2001. Software renewal process comprehension using dynamic effort estimation. In: Proceedings of International Conference on Software Maintenance, Florence, Italy. IEEE Computer Society Press, Los Alamitos, CA, pp. 209–218. De Lucia, A., Fasolino, A.R., Pompella, E., 2001a. A decisional framework for legacy system management. In: Proceedings of International Conference on Software Maintenance, Florence, Italy. IEEE Computer Society Press, Los Alamitos, CA, pp. 642– 651. De Lucia, A., Persico, A., Pompella, E., Stefanucci, S., 2001b. Improving corrective maintenance effort prediction: an empirical study. In: Proceedings of 7th IEEE Workshop on Empirical Studies of Software Maintenance, Florence, Italy, pp. 97–100. Di Penta, M., Rago, F., DiLucca, G., Antoniol, G., Casazza, G., 2001. A queue theory-based approach to staff software maintenance centers. In: Proceedings of International Conference on Software Maintenance, Florence, Italy. IEEE Computer Society Press, Los Alamitos, CA, pp. 510–519. Eick, S.G., Graves, T.L., Karr, A.F., Marron, J.S., Mockus, A., 2001. Does code decay? Assessing the evidence from change management data. IEEE Transactions on Software Engineering 27 (1), 1– 12. Florac, W.A., Carleton, A.D., 1999. Measuring the Software Process: Statistical Process Control for Software Process Improvement. Addison-Wesley, Reading, MA. Georgakopoulos, D., Hornick, H., Sheth, A., 1995. An overview of workflow management: from process modelling to workflow automation infrastructure. Distributed and Parallel Databases 3 (2), 119–153. Gray, A., MacDonnell, D., 1997. A comparison of techniques for developing predictive models of software metrics. Information and Software Technology 39 (6), 425–437. Hastings, T.E., Sajeev, A.S.M., 2001. A vector-based approach to software size measurement and effort estimation. IEEE Transactions on Software Engineering 27 (4), 337–350. Hertz, J., Krogh, A., Palmer, G.R., 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Reading, MA. IEEE Std. 1219, 1998. Standard for Software Maintenance, IEEE Computer Society Press, Los Alamitos, CA. Interesse, M., Dabicco, R., 1999. Beyond year 2000 remediation: the compliance verification––a case study. In: Proceedings of International Conference on Software Maintenance, Oxford, UK. IEEE Computer Society Press, Los Alamitos, CA, pp. 155– 160. Jeffery, R., Ruhe, M., Wieczorek, I., 2000. A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data. Information and Software Technology 42 (14), 1009–1016. Jones, C., 1999. Mass-Updates and software project management. Available at . Jorgensen, M., 1995. Experience with the accuracy of software maintenance task effort prediction models. IEEE Transactions on Software Engineering 21 (8), 674–681. Joshi, M., Kobayashi, H., 1995. Quantifying design productivity: an effort distribution analysis. In: Proceedings of European Design Automation Conference EURO-DAC, Brighton, UK. IEEE Computer Society Press, Los Alamitos, CA, pp. 476–481. Kl€ osch, R.R., Eixelsberger, W., 1999. Challenges and experiences in managing major software evolution endeavours such as Euro conversion or Y2000 compliance. In: Proceedings of International Conference on Software Maintenance, Oxford, UK. IEEE Computer Society Press, Los Alamitos, CA, pp. 161–166.

A. De Lucia et al. / The Journal of Systems and Software 65 (2003) 87–103 Lehman, M., Belady, L., 1985. Program Evolution: Processes of Software Change. Academic Press, London, UK. Lindvall, M., 1998. Monitoring and measuring the change-prediction process at different granularity levels: an empirical study. Software Process Improvement and Practice 4 (1), 3–10. Lynd, E.C., 1997. Living with the 2-digit year: year 2000 maintenance using a procedural solution. In: Proceedings of International Conference on Software Maintenance, Bari Italy. IEEE Computer Society Press, Los Alamitos, CA, pp. 206–212. Mattsson, M., 1999. Effort distribution in a six year industrial application framework project. In: Proceedings of International Conference on Software Maintenance, Oxford, UK. IEEE Computer Society Press, Los Alamitos, CA, pp. 326–333. Maurer, F., Dellen, B., Bendeck, F., Goldmann, S., Holz, H., Kotting, B., Schaaf, M., 2000. Merging project planning and web-enabled dynamic workflow technologies. IEEE Internet Computing 5 (3), 65–74. Meyers, R.H., 1986. Classical and Modern Regression with Applications. Duxbury Press, Boston. Niessink, F., van Vliet, H., 1997. Predicting maintenance effort with function points. In: Proceedings of International Conference on Software Maintenance, Bari, Italy. IEEE Computer Society Press, Los Alamitos, CA, pp. 32–39. Niessink, F., Vliet, H., 1998. Two case study in measuring maintenance effort. In: Proceedings of International Conference on Software Maintenance, Bethesda, Maryland, USA. IEEE Computer Society Press, Los Alamitos, CA, pp. 76–85. Paulk, M.C., Chrissis, M.B., 2000. The november 1999 high maturity workshop. Special Report CMU/SEI-2000-SR-003, Software Engineering Institute (Carnegie Mellon University), http://www. sei.cmu.edu/publications/documents/00.reports/00sr003/00sr003title. html. Paulk, M.C., Curtis, B., Averill, E., et al., 1991. Capability Maturity Model for Software. Technical Report, Software Engineering Institute (Carnegie Mellon University), Number CMU/SEI-91TR-24 ADA240603. Pigoski, T.M., 1997. Practical Software Maintenance––Best Practices for Managing Your Software Investment. John Wiley & Sons, New York. Pressman, R., 1997. Software Engineering A practitionerÕs approach, fourth ed. McGraw-Hill, New York, NY. Ramil, J.F., 2000. Algorithmic cost estimation software evolution. In: Proceedings of International Conference on Software Engineering, Limerick, Ireland. ACM Press, New York, NY, pp. 701–703. Shepperd, M., Schofield, C., Kitchenham, B., 1996. Effort estimation using analogy. In: Proceedings of International Conference on Software Engineering, Berlin, Germany. IEEE Computer Society Press, Los Alamitos, CA, pp. 170–178. Smith, R.K., Hale, J.E., Parrish, A.S., 2001. An empirical study using task assignment patterns to improve the accuracy of software effort estimation. IEEE Transactions on Software Engineering 27 (3), 264–271.

103

Stensrud, E., Myrtveit, I., 1999. Human performance estimating with analogy and regression models. IEEE Transactions on Software Engineering 25 (4), 510–525. Stuart, A., Old, J.K., 1991. KendallÕs Advanced Theory of Statistics, vol. 2, 5th edition, Edward Arnold, London. Tip, F., 1995. A survey of program slicing techniques. Journal of Programming Language 3 (3), 121–189. Vicinanza, S., Mukhopadhyay, T., Prietula, M., 1991. Software effort estimation: an exploration study of expert performance. Information System Research 2 (4), 243–262. Wheeler, D.J., Chambers, D.S., 1991. Understanding Statistical Process Control. SPC Press, Knoxville, TN. Weiser, M., 1984. Program Slicing. IEEE Transactions on Software Engineering 10 (4), 352–357. Wellman, F., 1992. Software Costing. Prentice-Hall Inc., Englewood Cliffs, NJ.

Andrea De Lucia received the Laurea degree in Computer Science from the University of Salerno, Italy, in 1991, the M.Sc. degree in Computer Science from the University of Durham, UK, in 1995, and the Ph.D. degree in Electronic Engineering and Computer Science from the University of Naples ‘‘Federico II’’, Italy, in 1996. He is currently an associate professor of Computer Science at the Department of Engineering of the University of Sannio in Benevento, Italy and member of the executive committee of the Research Centre on Software Technology (RCOST) of the same University. He serves in the program and organizing committees of several international conferences; he is program co-chair of the 2002 IEEE International Workshop on Source Code Analysis and Manipulation and he was program co-chair of the 2001 IEEE International Workshop on Program Comprehension. His research interests include software maintenance, empirical software engineering, reverse engineering, reuse, reengineering, migration, program comprehension, workflow management, document management, and visual languages. Prof. De Lucia is a member of the IEEE and the IEEE Computer Society. Eugenio Pompella received the Laurea degree in Engineering from the University of Naples ‘‘Federico II’’, Italy, in 1983. Since 1983 he worked in the IT field for different companies where he was involved in several projects with different roles, ranging from programmer to program manager. In 1990 he joined his current Company, that was renamed EDS Italia Software S.p.A. after it was acquired by EDS in 1995. In the period between 1997 and 2000 he was involved as project/ program manager in large Y2K and EURO conversion international projects. Since 2000 he has been involved in research projects conducted by EDS Italia Software in collaboration with the University of Sannio, the University of Naples ‘‘Federico II’’, and the University of Bari. His research interests include software maintenance, legacy systems assessment and reengineering, decision models for legacy systems, cost estimation models for software maintenance. He is a member of the IEEE Computer Society. Silvio Stefanucci received the Laurea degree in Computer Engineering from the University of Sannio, Italy, in 2000. He is currently a Ph.D. student at the University of Sannio. His research interests include empirical software engineering, software maintenance processes, and workflow management.