Protocol - Auditory-Perceptual Evaluation of Voice

The Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) indicates salient perceptual vocal attributes; the attributes are: (a) Overall Severity, (b) Roughness, (c) Breathiness, (d) Strain, (e) Pitch, and (f) Loudness. The CAPE-V displays each attribute accompanied by a 100-mm line forming a visual analog scale (VAS). Using a tick mark, the clinician indicates the degree of perceived deviance from normal for each parameter on this scale. For each dimension, scalar extremes are unlabeled.

Judgments may be assisted by referring to general regions indicated below each scale on the CAPE-V: “MI” refers to “mildly deviant,” “MO” refers to “moderately deviant,” and “SE” refers to “severely deviant.” A key issue is that the regions indicate gradations in severity rather than discrete points. The clinician may place tick marks at any location along the line. Ratings are based on the clinician’s direct observations of the patient’s performance during the evaluation rather than on patient report or other sources.

Specific Instructions

The individual should be seated comfortably in a quiet environment. The clinician should audio-record the individual’s performance on three tasks: vowels, sentences, and conversational speech. Standard recording procedures should be used that incorporate a condenser microphone placed at an azimuth of 45° from the front of the mouth and at a 4-cm microphone-to-mouth distance. Audio recordings are recommended to be made onto a computer with a minimum of 16 bits of resolution and a signal-sampling rate of no less than 20 KHz. This protocol applies to pediatric through adult ages; the requirement for participation is the ability to follow instructions and participate in reading or repeating stimuli to produce voicing.


Link to CAPE-V Form

Refer to CAPE-V form as you administer the following steps.

Task 1: Sustained vowels. Two vowels were selected for this task. One is considered a lax vowel (/a /) and the other tense (/ i /). In addition, the vowel, /i/, is the sustained vowel used during videostroboscopy. Thus, the use of this vowel during this task offers an auditory comparison to that produced during a stroboscopic exam.

The clinician should say to the individual, “The first task is to say the sound, /a/. Hold it as steady as you can, in your typical voice, until I ask you to stop.” (The clinician may provide a model of this task, if necessary.) The individual performs this task three times for 3 to 5 seconds each time. “Next, say the sound, /i/. Hold it as steady as you can, in your typical voice, until I ask you to stop.” The individual performs this task three times for 3 to 5 seconds each time.

Task 2: Sentences. Six sentences were designed to elicit various laryngeal behaviors and clinical signs. The first sentence provides production of every vowel sound in the English language, the second sentence emphasizes easy onset with the /h/, the third sentence is all voiced, the fourth sentence elicits hard glottal attack, the fifth sentence incorporates nasal sounds, and the final sentence is weighted with voiceless plosive sounds.

The clinician should give the person being evaluated flash cards, which progressively show the target sentences (see below) one at a time.

The clinician says, “Please read the following sentences one at a time, as if you were speaking to somebody in a real conversation.” (Individual performs task, producing one exemplar of each sentence.) If the individual has difficulty reading, the clinician may ask him or her to repeat sentences after verbal examples. This should be noted on the CAPE-V form. The sentences are: (a) The blue spot is on the key again; (b) How hard did he hit him? (c) We were away a year ago; (d) We eat eggs every Easter; (e) My mama makes lemon jam; and (f) Peter will keep at the peak.

Task 3: Running speech. The clinician should elicit at least 20 seconds of natural conversational speech using standard interview questions such as “Tell me about your voice problem” or “Tell me how your voice is functioning.”

Data Scoring

Although the PDF scale is accurate, printer configurations vary. Please verify that your paper copy has accurate 100-mm lines before reproducing the CAPE-V form. The clinician should have the individual perform all voice tasks—including vowel prolongation, sentence production, and running speech—before completing the CAPE-V form. If performance is uniform across all tasks, the clinician should mark the ratings, indicating overall performance for each scale. If the clinician notes a discrepancy in performance across tasks, he or she should rate performance on each task separately, on a given line. Only one CAPE-V form is used per individual being evaluated. In the case of discrepancies across tasks, tick marks should be labeled with the task number. Tick marks reflecting vowel prolongation should be labeled #1 (see form). Tick marks reflecting running speech (i.e., sentence reading) should be labeled #2. Tick marks reflecting spontaneous speaking should be labeled #3. In the rare event that the clinician perceives discrepancies within task type (e.g., /a/ vs. /i/), he or she may further label the ratings accordingly, such as 1/a/ versus 1/i/ to reflect the different vowels, or 2(a)-(b)-(c)-(d)-(e)- or (f) for the different sentences. Unlabeled tick marks indicate uniform performance. See examples below. (Note: Using labels to indicate discrepancies/variation across tasks in the severity of an attribute is different than indicating that an attribute is displayed intermittently [I]. If an attribute is judged to have equal severity whenever it appears, but it is not present all the time, “I” should be circled to indicate that the attribute is intermittent, and no additional labeling needs to be done.)

After the clinician has completed all ratings, he or she should measure ratings from each scale. To do so, he or she should physically measure the distance in millimeters from the left end of the scale. The millimeters score should be written in the blank space to the far right of the scale, thereby relating the results in a proportion to the total 100-mm length of the line. The results can be reported in two possible ways. First, results can indicate distance in millimeters to describe the degree of deviancy; for example, “73/100” on “strain.” Second, results can be reported using descriptive labels that are typically employed clinically to indicate the general amount of deviancy; for example, “moderate-to-severe” on “strain.”

We strongly suggest using both forms of reporting. It is strongly recommended that for all rating sessions following the initial one, the clinician have a paper or electronic copy of the previous CAPE-V ratings available for comparison purposes. He or she should also rate subsequent examinations based on direct comparisons between earlier and current audio recordings. Such an approach should optimize the internal consistency/reliability of repeated sequential ratings within a patient, particularly for purposes of assessing treatment outcomes. Although difficult, clinicians are encouraged to make every effort to minimize bias in all ratings.



Personnel and Training Required

An individual trained to evaluate speech disorders. A clinician or specialist is required to assess abnormal voice function. 

Equipment Needs

To record the session for comparison at later times, audio recordings require a condenser microphone, pre-amplifier, and laptop computer.

Requirement CategoryRequired
Major equipment No
Specialized training Yes
Specialized requirements for biospecimen collection No
Average time of greater than 15 minutes in an unaffected individual No
Mode of Administration

Clinical Examination


Child, Adolescent, Adult


All age groups (with the minimum requirement of being able to follow instructions and cooperate)

Selection Rationale

The Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) protocol was developed as a result of a consensus expert meeting in 2002, which was sponsored by the American Speech-Language-Hearing Association (ASHA), and has been used by clinicians and researchers since that time to evaluate voice characteristics.



caDSR Common Data Elements (CDE) Speech Consensus Auditory-Perceptual Evaluation of Voice Assessment Score 6773751 CDE Browser
Derived Variables


Process and Review

The Expert Review Panel #7 (ERP 7) reviewed the measures in the Speech and Hearing domain.

Guidance from ERP 7 includes the following:

  • Added a new measure
  • Created a new Data Dictionary

Protocol Name from Source

Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V)


Kempster, G. B., Gerratt, B. R., Verdolini A, K., Barkmeier-Kraemer, J., & Hillman, R. E. (2009). Consensus auditory-perceptual evaluation of voice: Development of a standardized clinical protocol. American Journal of Speech-Language Pathology, 18(2), 124–132.

Posted with permission from the American Speech-Language-Hearing Association (ASHA).

General References

Clinical Research Examples

Protocol ID


Speech, Language and Hearing
Measure Name

Auditory-Perceptual Evaluation of Voice

Release Date

June 4, 2019


An assessment of voice quality based on observations of the auditory and perceptual features of an individual.   


Clinicians and researchers can utilize this standardized tool and visual analog scale to assess the voice quality of an individual.


Consensus Auditory-Perceptual Evaluation of Voice, CAPE-V, voice, speech and hearing

