Program Evaluation: Define Questions and Methods
Steps five through eight will help you define the questions your evaluation should address, and develop the methods you'll use to conduct your evaluation (learn more about the other steps in general program evaluations):
- Step 5: Decisions and Questions
- Step 6: Develop Research Design
- Step 7: Identify Report Contents and Establish QA
- Step 8: Establish Quality Assurance Review Process
- Help clarify the reasons why you want to undertake an evaluation, e.g., to make particular decisions about the program
- Identify higher-level, more general questions the evaluation must answer and prioritize them
- Select specific, researchable questions the evaluation must answer.
The evaluation must ask questions where answers to these questions will satisfy the evaluation objectives. This is a very important step. If the answers to the questions do not satisfy the objectives, the evaluation resources will be wasted. To develop relevant questions, it will be useful to start with a table that arranges the following from left to right:
What type of evaluation is needed to develop the needed information? What kind of information is needed to inform decisions about the program?
What are the evaluation objectives? The objectives should be appropriate to the decisions under consideration and information needed.
Specify the higher-level general questions that need to be answered to satisfy the evaluation objectives.
Prioritize the general questions. Determine what information is most needed and when. These prioritized general questions are used to begin the process of specifying more detailed questions that will supply the required information. Some questions may not be answered if it is decided that the more important questions require all of the current evaluation resources. A multi-year evaluation strategy is helpful in that it can schedule coverage of all the general questions over a period of time.
Select specific, researchable questions that can be asked to answer the general questions. The most important questions should receive enough resources to develop defensible answers. Ensure that questions about outcomes include both direct and indirect outcomes. Pose all of the questions that you think are relevant. You will screen them later.
Evaluation Type, Evaluation Objectives, General Evaluation Questions, Priorities, and Specific Research Questions
|1: Type of Evaluation||2: Evaluation Objectives||3: General Evaluation Questions||4: Priorities High Medium Low||5: More Specific Researchable Questions|
|Needs/Market Assessment||Identify how FEMP can accelerate the efficiency with which federal agencies use energy.
Identify which federal agencies most need FEMP assistance.
Create a baseline for future evaluations.
|Q1: What is preventing federal agencies from giving energy efficiency improvement a higher priority for annual funding?
Q2: What do federal agencies need that FEMP can provide to increase the number of efficiency upgrades they implement?
Q3: What are the federal agencies most in need of FEMP assistance that have not accepted it?
Q4: What energy efficiency measures have been installed and/or what is the level of energy use prior to participation in FEMP's program? (Collected prior to participation and evaluation.)
|What are the market and agency barriers to adopting better energy and water management technologies?
How are FEMP actions directly and indirectly meeting specific customers needs or lowering barriers to action?
What customers is FEMP serving? Which need its services most?
What is the energy use per square foot of an office building prior to participation?
|Process Evaluation||Assess the adequacy of program funding relative to objectives.
Determine if funding is being used as intended.
Determine if populations that can benefit from the program are being served well.
Identify opportunities to improve effectiveness of activities and outputs
|Q1: What level of overall investment in energy efficiency are we leveraging with our spending?
Q2: Do the federal agencies perceive that we are helping them meet their energy-efficiency upgrade goals?
Q3: How can we make our services more productive for federal agencies?
|How much does FEMP spend and on what activities?
What is the total investment in energy and water projects?
Are FEMP partnerships leveraging funds and capabilities?
What is the quality of FEMP service and products?
What can FEMP do to improve its services and its service delivery, generally and specifically (e.g., website)
Is FEMP reaching the right customers and are they satisfied?
|Outcome Evaluation||Quantify the achievements of program outputs and outcomes against planned time frame. Assess whether further outcomes are possible and how to achieve them.||Q1: Have overall energy savings by the federal government increased from year to year?
Q2: How many quads of energy savings are in the pipeline?
Q3: Is progress toward energy-efficiency upgrade goals, by agency satisfactory? Can they meet these goals?
Q4: Are there any actions possible, by agency, that have not been undertaken?
|Is FEMP making progress, as measured by FY2002-2005 measures such as percent participation in the Procurement Challenge; or savings identified in audits, demonstrations, and projects in the pipeline?
Is the federal government, by agency, on track to meet its goals?
What agency actions/projects (retrofit, procurement) is possible or in the pipeline, demonstrating that those goals are being met?
|Impact Evaluation||Assess the net effect of the program's activities, i.e., the proportion of the outcomes that can be attributed to the program instead of to other influences.||Q1: How much were they achieving before each of FEMP's initiatives?
Q2: How much of the measured outcome can the program claim?
Q3: Which FEMP initiatives helped more than others?
Q4: What would have caused federal agencies to invest in energy efficiency upgrades had FEMP's programs not existed?
|To what degree did a FEMP program cause specific measured benefits?
Which FEMP tools helped more than others to create the benefits?
What external factors would have caused agencies to create savings without FEMP?
|Cost-Benefit Analysis||Determine program cost-effectiveness
Determine the cost-effectiveness of individual outcomes, outputs, or goals, where possible.
|Q1: Are the benefits from FEMP actions greater than the total of FEMP and customer costs?||Q1 High||What are the energy savings and emissions reductions attributable to FEMP initiatives, as determined by an impact evaluation?
What are the savings to the taxpayer as a result of FEMP initiatives?
What are FEMP's costs associated with the benefits that will be quantified?
- The questions and indicators for which data will be collected
- Inventory of existing data and identification of data gaps
- The method and timing by which the data will be collected
- The populations from which the data will be collected
- The choices of research accuracy, sampling precision and confidence level, and degree of defensibility for the results
- The method of analysis used to produce the evaluation results
- The method of reasoning from the results to answers to the questions.
Development of the research design entails creating the logical scheme for deducing useful answers from the collected data. The design must specify how the answers can be developed from each of the above evaluation activities. This, in turn, requires considering the alternatives available for each of the activities. The next sections discuss these alternatives, their uses, and their resource requirements.
Select Design Type
The research design can vary from simply tabulating the findings of a customer satisfaction survey or count of outcomes to inferring the net outcome of the program from the results of an experiment. Methods such as tabulating descriptive measurements and finding the statistical significance of a relationship between variables are usually not thought of as research designs, but, in fact, the process of going from the results of these analytical procedures to answers to evaluation questions involves logic and, therefore, constitutes a research design. These methods are not discussed further here because the logical process involved in using them to find answers to questions is relatively straightforward. They are mentioned because it is important to stress that the program manager should understand how the evaluation will derive answers to the evaluation questions from the data collected. The rest of this discussion describes the special type of research design required for impact evaluations known as an experimental design.
If you need to determine the proportion of a measured outcome that can be attributed to the program instead of to external influences (i.e., you need to conduct an impact evaluation), then some form of credible research design is required that will enable the study to infer this proportion. This design should be able to forecast what actions participants would have taken (outcomes) had your program not existed. The difference between what participants would have done and what they actually did is the amount of the observed outcome that you can attribute to your program. Evaluation research designs that allow you to make such claims of effect are called "experimental" or "quasi-experimental" designs.
Experimental and quasi-experimental designs are data-collection and analysis strategies that use deductive reasoning to estimate whether a programs' outcomes can be attributed to the program's activities and outputs or whether they were likely to have occurred anyway. True experimental designs, especially those using randomly assigned participant and control groups with before-after measurement, are the "gold standard" of evaluation research; however, they are rarely used in energy program evaluations because they require random assignment of the target population to participant and control (non-participant) groups. This is not possible with programs whose success depends upon voluntary participation. When group assignments are made using non-random methods, e.g., by matching non-participants to participants on key characteristics, the term "control group" is sometimes replaced with "comparison group" and the design is considered "quasi-experimental." There are a variety of such quasi-experimental designs. Here are three of the more popular for energy program evaluation:
Before-After Comparison Group Design: Compare program-participants and non-program-participants on before-program and after-program measurements. The amount that the program participants changed their behavior compared to the amount the non-participants changed is the amount of change the program caused. For example, how did the before-and-after efficiency-upgrade actions in the school system of a Rebuild America community partner compare to the before-and-after efficiency-upgrade actions taken, if any, in the school system of a non-Rebuild America community with similar characteristics?
After-Only Comparison Group Design: A less defensible variant of this design eliminates the before-program measurements and simply compares the two measurements at the same point after the participants participated in the program. Deduce the program effect by comparing the activities of the two groups.
Before-After Participant Group Time-Series Design: If you do not have a good non-participant comparison group, you compare trends in participant behavior before and after the program. For example, if a pattern of little weatherization activity is seen before participation and the amount of activity suddenly jumps at about the time of program participation and continues at a higher level, conclude that the difference is due to the program. This design has less defensibility than the two described above, but costs less to implement and may be all that is feasible with the available data.
The following two non-experimental research designs are also used for impact evaluations, although they have very weak defensibility. These designs do not use valid or reliable data collection methods to account for any of the possible external influences that might have caused the observed difference. Their only advantage lies in their low cost in comparison to experimental research.
Participant Group Before-After Design: Measure participant behavior one time before (baseline) and one time after participation. Conclude that any difference is a result of the program. The possible effects of external influences may be acknowledged by hypothesizing their existence or by asking the participants whether they would have taken the action without the program (see design #5).
Participant Group Self-Report Design: Measure participant behavior one time after participation and ask participants to tell you whether they would have taken the measured actions had the program not existed. Participants may also be able to tell the researcher which external influences affected their actions; however, any effort to quantify these external effects will lack credibility. Similarly, the defensibility of participant claims about what they would have done, or energy they would have used, if they had not participated is weak. Some respondents will have a tendency to give the interviewer what they think is a socially acceptable answer or an answer that will make them look good; therefore, this research design also has weak defensibility. Nevertheless, this design is widely used for energy-program impact evaluations both within and outside of the Federal Government because it is relatively inexpensive and does not depend on a pre-program baseline measurement.
Select Data to Be Collected and a Data Collection Plan
Data collection is the process of taking measurements on the indicators that will be used to answer the evaluation's specific research questions. The program manager may choose to leave most of the decision making for data collection to an evaluation expert; however, a basic understanding of the commonly used alternatives will help the manager evaluate the recommendations offered.
What to Collect: Choice of Indicators
Indicators are the metrics, or researchable variables, for which the evaluation must collect, or develop, data. Indicators must be selected whose data will provide the answers to specific research questions needed to satisfy the evaluation's objectives. You will want to be sensitive to defensibility and cost when you select indicators.
Sources of Data
- EIA energy end-use data.
- Census data.
- Energy savings coefficients, i.e., estimates of energy savings (e.g., kilowatt hours) per unit outcome (e.g., installation of an efficient motor), developed for one EERE program may have relevance to another.
If applicable secondary data are available, it is advisable to use them because secondary data will significantly reduce data-collection costs. However, two very important caveats must be considered. The data must be relevant and their transfer to the program for which they will be used must be defensible. In particular, if energy-savings coefficients or gross or net estimates are available from the evaluation of another program, the program manager must ensure that the circumstances under which they were developed and the method of developing them are appropriate to the purposes of the current program. Among other considerations, an energy-savings estimate would need to fit end-user industry and size profiles, as well as the application profile, to be credibly applied to other end-user populations and technology applications. If the secondary data do not satisfy such considerations, their use will not be defensible, and they should not be used for the current program.
Alternative Data-Collection Methods
A variety of methodological options exist for collecting data on (measuring) the indicators. They include: In-person surveys, mail surveys, telephone surveys, website or email surveys, interviews, focus groups, observation, literature review, and program records and reporting.
Another data-collection choice involves whether the evaluation collects data (1) from the entire population of participants (like a census), or (2) from a sample. Either option may be used for any type of evaluation; however, like most of the other choices, the choice has implications for cost and defensibility of the results.
It will be very useful when communicating with evaluation experts to be aware of the difference between "statistical precision" and "accuracy" as used in survey-based data-collection activities. "Statistical precision," also known as "sampling error," applies to samples and consists of two parts: (1) how close (within a plus or minus interval) you want a sample estimate to be to the true population value, and (2) the probability of getting a sample whose results will lie inside the desired interval. The former is the width of the interval within which you want the true value of the variable being estimated to lie in relation to the estimated value, e.g., plus or minus 10%. The probability of getting a result that will lie inside this interval is the "confidence level" that the sample will deliver a result within this interval. Usually, "statistical precision" and "confidence level" together are specified as a "confidence interval," e.g., +/-10% with 90% confidence, or often, 90 +/-10%. If statistical results are desired for any of the specific questions, a program manager may ask the evaluation contractor to recommend the target confidence interval(s) for the findings.
"Accuracy" refers to the correspondence between the measurement made on an indicator and the true value of the indicator. Accuracy describes the exactness of the measurements made in the sample. In the sampling literature, accuracy is part of the concept of "non-sampling error."
Accuracy should be a concern when the data-measurement instruments are designed. The questionnaires and other data-collection instruments should be pre-tested before deploying them for actual evaluation measurements.
OMB Clearance to Collect Data
If the audience from which you need to collect data does not consist exclusively of federal government employees and the evaluation needs primary data from more than nine members of this audience, including potential customers, then the data collection activity will require clearance from the OMB. Federal government employees are excluded from the OMB clearance requirement only if the questions to be asked of them involve activities associated with their employment; otherwise, surveys of federal employees also require OMB clearance.
The time required to obtain OMB clearance varies:
For customer satisfaction surveys and pretests of other survey instruments, there is an expedited process that, in most cases, takes two to four weeks. The Forms Clearance staff of EIA's Statistics and Methods Group can assist EERE staff with this process.
For surveys other than customer satisfaction surveys, the OMB clearance process takes longer. Currently, the entire clearance process may require five to eight months. EERE clearance applications are submitted to the Records Management Office (IM-11) of DOE's Chief Information Officer.
An OMB clearance is valid for three years.
Checks that certain answers to a survey are internally consistent, e.g., if a residence does not have access to natural gas, but the respondent says the residence uses gas water heat, the responses are not consistent and should be checked or discarded
Pre-tests of survey questionnaires to verify that respondents understand the question, that respondents are interpreting questions the way you want them to, and that skip patterns are correct
Checks of survey questions to ensure that they ask only a single question
Specification of normal and expected ranges for quantitative responses to measurements so that outliers will be detected
Specification for procedures for imputing missing data within a questionnaire if missing-data imputation is proposed
Procedures for estimating whether non-response to a survey will affect the representativeness of the results
Double key-entry of manually collected data if they will be keyed into an electronic database.
Develop a Data Collection Plan
- What is the population from which data will be collected to answer the evaluation questions
- Which indicators will be measured to answer the evaluation questions
- What kind of data-collection method(s) will be used to make the measurements (may depend on the proposed method of analysis)
- Whether a sample or census of the population will be used
- If a sample will be used, the target confidence interval
- Whether OMB clearance will be required, and if so, an outline of the procedures for doing so
- A data quality assurance (QA) plan that provides checks on the reliability and accuracy of the measurements
- Identification of data that may have value for a future evaluation and provision for archiving it.
- The schedule for the data-collection task.
Analytical Methods for Answering the Evaluation Questions
Many analytic methods are available for developing findings from data. Some of the more common analytic methods used to develop evaluation results are: Case Study, Content Analysis, Meta Evaluation, Expert Judgment, Cost-Benefit Analysis, Engineering Estimation, Tabulation and Cross-Tabulation, Correlation, Regression, Differences of Means of Proportions, and Survival Analysis. Many of these methods can be used for more than one type of evaluation. More than one of the methods listed may be used in the same evaluation analysis. For example, engineering analysis is sometimes used to create an estimate of the energy saved from installing a particular energy conservation measure. The engineering estimate is then used as a variable in a regression analysis to estimate a regression coefficient that will adjust the engineering estimate to reflect the actual savings observed by, e.g., metering. Refer to Evaluation Guide General Program and also to R&D Evaluation Methods Direction.
Specify the types of information that the evaluation report must provide before the evaluation begins. If high-level decision-makers will read the report, it may also be useful to specify the expected outline.
- Consider topics or themes related to the evaluation that would be of interest to the audience receiving the report.
- If different decision-makers need different information from the evaluation, consider writing multiple reports.
Answers to all of the questions specified for the evaluation
Recommended improvements to the program, if relevant (indicate which are high priority compared to others)
A description of the research design, assumptions, how the data were collected, and the analysis methods. These descriptions should be brief in the main report where the focus is on the answers to the evaluation questions and recommendations. Put the comprehensive explanations in an appendix.
Recommended improvements to the evaluation process including any lessons learned about data collection and analysis methods that might aid future evaluations. These can be based on the evaluation contractor's experience and observations during the evaluation process.
This step is essential to ensure the evaluation is defensible giving consideration to the resources that were available for it. It specifies how the data collection, analysis, and reporting activities will themselves be evaluated.
Establish a standing peer review panel comprised of 2-5 outside experts, who provide an oral and/or written review of the Evaluation Plan before the evaluation is conducted and who reconvene to review the Draft Evaluation Report. A peer review panel usually consists of two to five independent outside experts.
Identify an ad hoc panel of external evaluation experts to review and provide written comments only on the Evaluation Plan, the Draft Evaluation Report, and sometimes samples of the research data and the Final Evaluation Reports.
Judgments of the strength of the evaluation are largely subjective; they depend on the reviewer's own training and experience. The objectivity of the process can be aided by creating a list of specific "criteria" that the reviewers must address. The following list includes criteria that have been proposed for peer reviews by other organizations:
The research questions are well formulated and relevant to the objectives of the evaluation.
The indicators are credible as measures of outputs and outcomes being evaluated.
The research design has validity.
For statistical methods, the degree of relationship between indicators, tests of significance, and confidence intervals (statistical precision) for sample estimates were built into the analysis and applied wherever possible.
The research demonstrates understanding of previous related studies.
The data collection and analysis methods are credible.
The data and assumptions about the research design are sound.
All planned data were collected, or if some values are missing, how they were treated.
If missing data values were inferred, the inference method was appropriate.
If a survey was conducted, non-response is accounted for.
The data collection methods were actually implemented as planned, or if revisions were required by circumstances, they were appropriate and the reasons for the revisions are documented.
Collected data are provided and their layout documented.
The analysis methods were actually implemented as planned, or if revisions were required by circumstances, they were appropriate and the reasons for the revisions are documented.
The documentation of the methodology is accurate, understandable, and temperate in tone.
The report outline draft is appropriate and likely to present the study findings and recommendations well, and to provide documentation of methods used.
The draft findings and recommendations in the evaluation report follow logically from the research results and are explained thoroughly.
- The report presents answers to all of the questions asked.
The quality assurance procedures should be included in the Evaluation Plan.