(navigation image)
Home American Libraries | Canadian Libraries | Universal Library | Community Texts | Project Gutenberg | Children's Library | Biodiversity Heritage Library | Additional Collections
Search: Advanced Search
Anonymous User (login or join us)
Upload
See other formats

Full text of "Assessment for Counselors"

Assessment 

for 

Counselors 



Bradley T. Erford 



WVWWUWVyWUWWWWWVWUWWU 1 



Digitized by the Internet Archive 
in 2012 



http://www.archive.org/details/assessmentforcouOObrad 




Assessment 
for Counselors 



BRADLEY T. ERFORD 

Loyola College in Maryland 



? BROOKS/COLE 

CENGAGE Learning- 



Australia • Brazil • Japan • Korea • Mexico • Singapore • Spain • United Kingdom • United States 



Dedication 

This effort is dedicated to The One: the Giver of energy, passion, and 
understanding; who makes life worth living and endeavors worth 
pursuing and accomplishing; the Teacher of love and forgiveness. 



; BROOKS/COLE 

1* CENGAGE Learning- 



Assessment for Counselors 
Bradley T. Erford 

Publisher: Barry Fetterolf 

Senior Editor: Mary Falcon 

Editorial Assistant: Evangeline Bermas 

Senior Project Editor: Kimberly Gavrilles 

Art and Design Manager: Gary Crespo 

Composition Buyer: Chuck Dutton 

Associate Manufacturing Buyer: 
Brian Pieragostini 

Director of Sales and Marketing: 
Heather Murray 

Cover image © Mark Stephen/ 
theispot.com 



© 2007 Brooks/Cole, Cengage Learning 

ALL RIGHTS RESERVED. No part of this work covered by the copyright 
herein may be reproduced, transmitted, stored, or used in any form or by 
any means graphic, electronic, or mechanical, including but not limited to 
photocopying, recording, scanning, digitizing, taping, Web distribution, 
information networks, or information storage and retrieval systems, 
except as permitted under Section 107 or 108 of the 1976 United States 
Copyright Act, without the prior written permission of the publisher. 



For product information and technology assistance, contact us at 
Cengage Learning Customer & Sales Support, 1-800-354-9706. 

For permission to use material from this text or product, submit 

all requests online at www.cengage.com/permissions. 

Further permissions questions can be emailed to 

permissionrequest@cengage.com. 



Library of Congress Control Number: 2006923762 

ISBN-13: 978-0-618-49291-6 
ISBN-10: 0-618-49291-7 

Brooks/Cole 

20 Davis Drive 
Belmont, CA 94002-3098 
USA 

Cengage Learning is a leading provider of customized learning solutions 
with office locations around the globe, including Singapore, the United 
Kingdom, Australia, Mexico, Brazil, and Japan. Locate your local office at: 
www.cengage.com/global. 

Cengage Learning products are represented in Canada by Nelson 
Education, Ltd. 

To learn more about Brooks/Cole, visit www.cengage.com/brookscole. 

Purchase any of our products at your local college store or at our preferred 
online store www.cengagebrain.com. 



Printed in the United States of America 
5 6 7 8 9 10 13 12 II 



CONTENTS 



PARTI 



Preface xiii 
Acknowledgments xiv 
About the Authors xv 



Chapter 1 Basic Assessment Concepts Bradley T. Erford 1 

Assessment and Counseling 1 

What Is Assessment? 2 

The Purpose of Assessment 5 

How Is Assessment Used in Counseling? 8 

Assessment Competence and Professional Counselors 9 

Training Standards for Professional Counselors 1 
Professional Counselor Organizations and Assessment 10 
Assessment Training Standards 1 2 

Assessment Terms and Concepts 21 

Standardized (Formal) and Nonstandardized (Informal) Tests 21 

Norm-Referenced and Criterion-Referenced Tests 22 

Individual and Group Tests and Inventories 23 

Objective and Subjective Tests 23 

Speed and Power Tests 23 

Verbal and Nonverbal Tests 24 

Cognitive and Affective Tests 26 

Maximum and Typical Performance Measurement 27 

Behavioral Observations 28 

Basals, Starting Points, and Ceilings 28 

Reliability 32 

Validity 33 

Formative Versus Summative Evaluation 34 

Pencil-and- Paper Tests and Performance (Authentic) Assessment 34 

Portfolio Assessment 36 

Environmental Assessment 38 

Computer-Managed, Assisted, and Adapted Assessment 38 

Summary/Conclusion 42 

Key Terms 42 

iii 



iv Contents 



Chapter 2 Foundations of Assessment: 

Historical, Legal, Ethical, and Diversity Perspectives 

Bradley T. Erford, Cheryl Moore-Thomas, and Lynn Linde 45 

The History of Assessment 45 

Ancient Times 48 
Measurement in the Laboratory 49 

Modern Clinical Applications of Assessment: Decision Making 
and Determination of Individual Differences 50 

Public and Professional Concerns About Assessment 62 

Decisions About Peoples' Lives Should Not Be Made on the Basis 

of a Single High-Stakes Test Score 64 
Tests Are Biased and Unfair to Minorities and Women 64 
Tests Create Anxiety and Stress 65 
Tests Label and Categorize 65 

Test Developers Dictate What Students Must Know or Learn 66 
"Teaching to the Test" Inflates Scores 67 
Multiple-Choice Questions Punish Intelligent, Creative Thinkers; 

Trivialize the Complexities of the Learning Process; 

and Reward Good Guessers 67 
Learning From Past Mistakes and Criticisms 68 

Ethics and Assessment 69 

Ethical Decision Making 78 

Legal Issues in Assessment 80 

The Family Educational Rights and Privacy Act of 1 974 (FERPA) 

and Related Legislation 81 
Minimal Competency Assessment and the No Child Left Behind Act 

of2001 83 
The Individuals With Disabilities Education Improvement Act 

of 2004 (IDEIA) and Related Legislation 84 
The Health Insurance Portability and Accountability Act 

ofl996(HIPAA) 86 
Guidelines of the Equal Employment Opportunity Commission 

(EEOC) 87 
The Americans With Disabilities Act of 1 99 1 (ADA) 8 8 
Court Decisions Related to Diversity in Assessment 88 

Diversity Issues in Assessment 90 

Understanding Diversity 90 

Standards for Multicultural Assessment 91 

Diversity Factors Involved in Assessment 91 

Bias in Assessment 94 

Content Bias 94 



Contents v 

Internal Structure Bias 95 

Predictive Bias 95 

Interpreting Test Scores With Caution 95 

Ensuring Fairness in Assessment 96 

Summary/Conclusion 97 

Key Terms 97 

Chapter 3 Reliability Dimiter Dimitrov 99 

What Is Reliability? 99 

The Classical Model of Reliability 101 

True Score 101 

The Classical Definition of Reliability 102 

Standard Error of Measurement (SEM) 102 

Types of Reliability 105 

Internal Consistency 105 

Test-Retest Reliability 1 08 

Alternate Forms Reliability (Equivalent Forms Reliability) 109 

Reliability of Criterion-Referenced Tests 110 

Interscorer and Interrater Reliability 113 

The Importance of Reliability 114 

Reliability in Validation 114 

Attenuation 114 

Reliability of Composite Scores 116 

Reliability of Sum of Scores 116 
Reliability of Difference Scores 118 
Reliability of Weighted Sums 119 

Summary/Conclusion 120 

Key Terms 121 

Chapter 4 Validity Alan Basham and Bradley T. Erford 123 

Validity Defined 123 

Face Validity 124 

Content-Related Validity 125 

Criterion-Related Validity 126 

Standard Error of Estimate 128 

Construct Validity 131 



vi Contents 



The Interaction of Reliability and Validity 133 

Validity and Testing Practice 133 

The Application of Validity: Decision Making 

Using Test Scores 134 

Decision Making Using a Single Score 1 34 
Decision Making Using Multiple Tests 14 1 

Summary/Conclusion 157 

Key Terms 1 57 

Chapter 5 Selecting, Administering, Scoring, and Interpreting 
Assessment Instruments and Techniques 

R. Anthony Doggett, Carl J. Sheperis, Susan Eaves, 

Michael D. Mong, and Bradley T. Erford 159 

Test Selection 1 59 

Test Administration 160 

Administrator Requirements 160 
Examinee Preparation 162 
Environmental Concerns 163 
Testing Procedures 163 
Factors Affecting Test Scores 164 

Test Scoring 165 

Professional Standards in Testing 166 

Norm-Referenced Interpretation 168 

Developmental Equivalents 168 

Scores of Relative Standing 170 

Percentile Ranks 172 

Applying Standard Error of Measurement (SEM) to Test Scores 173 

Criterion-Referenced Interpretation 180 

Single-Skill Scores 180 
Multiple-Skill Scores 180 

Sources of Information About Tests 181 

Published Resources 182 

PRO-ED 183 

Publisher Catalogs 184 

Professional Journals and Textbooks 184 

Electronic Resources 1 84 

Common Errors 185 

Summary/Conclusion 187 

Key Terms 188 



Contents vii 

Chapter 6 How Tests Are Constructed 

Carl J. Sheperis, Carey Davis, and R. Anthony Doggett 189 

Purpose of the Test 190 

Examinees 1 9 1 

Goals and Theory 191 

Norm Referenced or Criterion Referenced 191 

Objectives 1 92 

Scaling 192 

Approaches to Test Construction 1 94 

A Test Development Example 1 94 

Observables 196 

Defining Observables 197 

An Example of Observables 198 

Item Generation 198 

Allocating Proportionate Numbers of Items 199 
Selecting an Item Format 199 
Descriptions of Item Formats 199 
An Example of Item Generation 20 1 

Technical Analyses 201 

Item Difficulty 202 
Item Discrimination 203 
Norms 204 

Summary/Conclusion 204 

Key Terms 206 



PART II 



Chapter 7 Clinical Assessment Bradley T. Erford, Carol Salisbury, 
Kathleen McNinch, Carl J. Sheperis, R. Anthony Doggett, 
and Ota Masanori 207 

What Is Clinical Assessment? 207 

Cautions Within Clinical Assessment 209 

Clinical Judgment Versus Statistical Models 213 

Clinical Interviewing 214 

Three Types of Interviews: Unstructured, Semi-Structured, 

and Structured 214 
The Intake Interview 216 
Mental Status Exam 217 
Strengths and Limitations of Interviewing 2 1 9 



viii Contents 



Counseling, Diagnosis, and the DSM-IV-TR 221 

Using the DSM-IV- TR— Multiaxial Diagnosis 223 

Axis I Disorders — Clinical Disorders and Other Conditions That May Be 

a Focus of Clinical Attention 226 
Axis II Disorders — Personality Disorders and Mental Retardation 229 
Axis III — Current Medical Conditions 229 
Axis IV — Psychosocial and Environmental Problems 230 
Axis V — Global Assessment of Functioning (GAF) 230 
Diagnostic Decision Making Using the DSM-IV- TR 23 1 

Using Clinical Inventories and Tests in Counseling 234 

Information Sources for Clinical and Personality Assessment 234 
How Clinical and Personality Test Content Is Developed 235 

Some Commonly Used Clinical Assessment Inventories 237 

Minnesota Multiphasic Personality Inventory — Second Edition 

(MMPI-2) 237 
Minnesota Multiphasic Personality Inventory — Adolescent (MMPI-A) 24 1 
Millon Clinical Multiaxial Inventory — III (MCMI-III) 246 
Millon Adolescent Clinical Inventory (MACI) 253 
Achenbach System of Empirically Based Assessment (ASEBA) 254 
Personality Inventory for Children — Second Edition (PIC-2) 257 
Devereux Scales of Mental Disorders (DSMD) 258 
Children's Depression Inventory (CDI) 258 

Reynolds Adolescent Depression Scale — Second Edition (RADS-2) 259 
Symptom Checklist-90— Revised (SCL-90-R) 260 
Beck Depression Inventory — Second Edition (BDI-II) 260 
Beck Anxiety Inventory (BAI) 26 1 
Beck Scale for Suicide Ideation (BSSI) 262 
Substance Abuse Subtle Screening Inventory — 3 (SASSI-3) 263 
Eating Disorder Inventory — 3 (EDI-3) 264 

Summary/Conclusion 265 

Key Terms 265 



Chapter 8 Personality Assessment 

Bradley T. Erford, Kathleen McNinch, and Carol Salisbury 267 

What Is Personality? 267 

The Purpose of Personality Assessment 268 

Trait Approaches to Personality Assessment 269 

Strengths and Limitations of the Trait Approach 271 

Some Commonly Used Structured Personality 

Assessment Inventories 273 



Contents ix 

Revised NEO Personality Inventory (NEO-PI-R) 273 
16 Personality Factors (1 6PF) Questionnaire 275 
Myers-Briggs Type Indicator — Form M (MBTI) 279 
Millon Index of Personality Styles Revised {MIPS Revised) 28 1 
Personality Assessment Inventory (PAT) 281 
California Psychological Inventory (CPI) 282 
Jackson Personality Inventory — Revised (JPI-R) 283 
Piers-Harris Children's Self Concept Scale — Second Edition 

(Piers-Harris-2) 286 
Coopersmith Self Esteem Inventories 287 
Tennessee Self Concept Scale — Second Edition { TSCS-2) 287 

Projective Approaches to Assessment 288 

Strengths and Weaknesses of Projective Techniques 295 

Some Commonly Used Projective Techniques 296 

Rorschach Inkblot Test 296 

Thematic Apperception Test {TAT) 297 

Children's Apperception Test — 1991 Revision {CAT) 297 

Roberts Apperception Test for Children — Second Edition {Roberts-2) 298 

House-Tree-Person {H-T-P) Projective Drawing Technique 298 

Kinetic Drawing System for Family and School {KDS) 300 

Forer Structured Sentence Completion Test {FSSCT) 300 

Summary/Conclusion 302 

Key Terms 302 

Chapter 9 Behavioral Assessment Carl J. sheperis, R. Anthony Doggett, 

Masanori Ota, Bradley T. Erford, and Carol Salisbury 303 

What Is Behavioral Assessment? 303 

Defining Behavior 304 

Guidelines for Conducting Behavioral Assessment 305 

Methods of Behavioral Assessment 306 

Direct Assessment 306 
Indirect Assessment 309 

Behavioral Rating Scales and Inventories Used in Counseling 311 

Conners' Rating Scales — Revised { CRS-R) 311 

Attention Deficit Disorders Evaluation Scale — Third Edition {ADDES-3) 312 

Behavior Assessment System for Children {BASQ 313 

Disruptive Behavior Rating Scale {DBRS) 314 

Coping Inventory for Stressful Situations { CISS) 315 

Summary/Conclusion 317 

Key Terms 317 



Contents 



Chapter 10 Assessment of Intelligence 

Bradley T. Erford, Lauren Klein, and Kathleen McNinch 319 

What Is Intelligence? 319 

Nature and Theories of Intelligence 321 

Historical Conceptualizations of Intelligence 321 

Multiple-Factor Models 325 

Guilford's Structure-of-Intellect Model 327 

Hierarchical Models 328 

Sternberg's Triarchic Theory: An Information Processing Approach 329 

Gardner's Multiple Intelligences 330 

Some Final Thoughts on the (Practical) Nature of Intelligence 334 

Commonly Used Tests of Intelligence 335 

Group-Administered Tests of Intelligence and School Ability 335 
Individual Screening Tests of Intelligence 338 
Individual Diagnostic Tests of Intelligence 340 

Assessing Mental Retardation 350 

Assessing Giftedness 352 

Summary/Conclusion 354 

Key Terms 354 

Chapter 1 1 Assessment of Other Aptitudes 

Bradley T. Erford and Kathleen McNinch 357 

Aptitude Tests Designed for Admission Decisions 358 

Commonly Used Admission Tests 359 

Tests of General and Specific Aptitude 366 

Multiaptitude Batteries 366 
Measures of Special Abilities 375 

Summary/Conclusion 384 

Key Terms 384 

Chapter 12 Assessment of Achievement 

Bradley T. Erford and Kathleen Hall 385 

Why Assess Achievement? 385 

Uses of Achievement Tests in Counseling 387 

Achievement Testing and Individuals With Special Needs 388 



Contents xi 

The Individuals With Disabilities Education Improvement Act 

(IDEIA) 388 
Section 504 of the U.S. Rehabilitation Act of 1973 392 

Categorizing Achievement Tests 393 

Group-Administered Multi-Skill Achievement Test Batteries 395 
Individual Achievement Multi-Skill Test Batteries 406 
Individual and Group-Administered Single-Skill Achievement Tests 

for Reading 416 
Individual and Group-Administered Single-Skill Achievement Tests 

for Mathematics 422 
Individual and Group-Administered Single-Skill Achievement Tests 

for Written Expression 424 
Tests of English Language Proficiency 429 

Summary/Conclusion 432 

Key Terms 432 

Chapter 1 3 Assessment in Career Counseling 

Deborah Newsome, Bradley T. Erford, and Kathleen McNinch 435 

Purposes of Career Assessment 435 

Assessing Interests 437 

Tests Measuring Interests 440 

Other Interest and Skill Inventories 454 

Assessing Values and Life Role Salience 456 

Commonly Used Tests Assessing Values and Life Role Salience 457 
Other Measures of Career Values and Life Role Salience 458 

Assessing Career Development and Career Maturity 460 

Tests Used to Assess Career Development and Career Maturity 461 

Summary/Conclusion 463 

Key Terms 463 

Chapter 14 Assessing Couples and Families 

Debbie W. Newsome, Jon-Michael Brasfield, and Catherine Flemming 465 

Purposes of Couple and Family Counseling 465 

Rationale for Family Assessment 466 
What Is Assessed? 467 
Methods of Assessment 470 

Formalized Assessment Instruments 470 



xii Contents 



Assessment of Couples 471 

Other Instruments Used in Assessing Couples 481 

Assessment of Families 482 

Other Measures of Family Assessment 489 

Qualitative Assessment of Family Relationships 490 

Characteristics of Qualitative Assessment 490 
Qualitative Assessment Methods 49 1 
Mapping Activities 493 
Sculpting Activities 498 
Other Qualitative Methods 500 

Summary/Conclusion 501 

Key Terms 501 



Appendix Responsibilities of Users of Standardized Tests 
(RUST) (3rd Edition) Association for Assessment 

in Counseling (AAC) 502 



References 509 

Name Index 554 

Subject Index 560 






PREFACE 



Assessment is counseling and counseling is assessment! The evolving profession of 
counseling has entered the age of accountability, regardless of specialization or prac- 
tice venue. Managed care and school reform have become important forces driving 
decision making in contemporary society. Given this context, the more a profes- 
sional counselor knows about formal and informal assessment procedures, the more 
informed, effective, and efficient the professional counselor's treatment of clients and 
students can be. 

A second driving force comes from within the counseling profession itself. After 
many years of identity exploration and discussion, the counseling profession has 
agreed to a basic core of education and training standards that all professional coun- 
selors should meet. This book is designed to address the core curricular assessment 
requirements of the Council for Accreditation of Counseling and Related 
Educational Programs (CACREP), thereby providing state-of-the-art information 
on assessment and tests that professional counselors need to know. But what makes 
Assessment for Counselors different from other books is that it is written by profes- 
sional counselors for professional counselors. 

The first half of Assessment for Counselors provides important general informa- 
tion about assessment, including basic concepts, historical developments, ethical and 
legal implications, diversity issues, reliability, validity, test construction, and the se- 
lection, administration, scoring, and interpretation of assessment instruments. The 
second half of this book provides in-depth explorations of the major areas of assess- 
ment that professional counselors either provide or of which they must be aware. 
Embedded within these domains of counseling specialty, this text includes reviews of 
more than 100 commonly used tests in the areas of clinical, personality, behavioral, 
intelligence, aptitude, achievement, career, and couples and family assessment. In 
short, Assessment for Counselors is the most comprehensive introductory assessment 
text ever written specifically for professional counselors. 



XIII 



ACKNOWLEDGMENTS 



The editor would like to thank Kami McNinch, Lauren Klein, Katie Hall, and 
Megan Earl for their tireless assistance in the preparation of the original manuscript. 
All of the contributing authors are to be commended for lending their expertise in 
the various topical areas or on the various tests reviewed in this volume. As always, 
Barry Fetterolf, publisher, and Mary Falcon, senior editor of Lahaska Press, have 
been wonderfully responsive and supportive. Finally, special thanks go to three out- 
side accuracy reviewers who carefully scrutinized the entire manuscript and whose 
comments led to substantive improvement in the final product: Gerald Chandler, 
University of Central Oklahoma; Darcy Haag Granello, The Ohio State University; 
and Joshua C. Watson, Mississippi State University, Meridian. 



XIV 



ABOUT THE AUTHORS 



THE EDITOR 



Bradley T. Erford, Ph.D., is director of the School Counseling Program and a pro- 
fessor in the Education Department at Loyola College in Maryland. He is the recip- 
ient of the American Counseling Association's (ACA) Professional Development 
Award, ACA Research Award, and the ACA Carl Perkins Government Relations 
Award, and is an ACA Fellow. He has received the Association for Counselor 
Education and Supervision's Robert O. Stripling Award for Excellence in Standards, 
the Association for Assessment in Counseling and Education/Measurement and 
Evaluation in Counseling and Development Research Award, the Maryland 
Association for Counseling and Development's Maryland Counselor of the Year, 
Professional Development, Counselor Visibility, and Counselor Advocacy Awards. 
His research specialization is primarily in development and technical analysis of psy- 
choeducational tests and has resulted in the publication of numerous books, journal 
articles, book chapters, and psychoeducational tests. 

He is past chair of the American Counseling Association-Southern (U.S.) 
Region; past president of the Association for Assessment in Counseling and 
Education; past president of the Maryland Association for Counseling and 
Development; past president of the Maryland Association for Counselor Education 
and Supervision; past president of the Maryland Association for Mental Health 
Counselors; and president of the Maryland Association for Measurement and 
Evaluation. Dr. Erford is the past chair of ACA's Task Force on High Stakes Testing; 
past chair of ACA's Task Force on Standards for Test Users; past chair of ACA's Public 
Awareness and Support Committee; and past chair of ACA's Interprofessional 
Committee. Dr. Erford is a licensed clinical professional counselor, licensed profes- 
sional counselor, nationally certified counselor, licensed psychologist, and licensed 
school psychologist. He teaches courses primarily in the areas of assessment, human 
development, school counseling, and stress management. 



THE CONTRIBUTING AUTHORS 



Alan Basham, M.A., is a counselor educator at Eastern Washington University, 
where he teaches (among other subjects) advanced appraisal for CACREP programs 
in school counseling and mental health counseling. He is past president of the 
Association for Spiritual, Ethical and Religious Values in Counseling and of the 
Washington Counseling Association. He drafted ACA's Code of Leadership and 



xv 



xvi About the Authors 



served on the task forces that wrote ACA's position papers on test user qualifications 
and high-stakes testing. He also provides leadership and teamwork training for 
Washington State's Critical Incident Management teams. He lives near, and often 
roams with his dog Chinook through, the woods surrounding the Spokane River. 

Jon-Michael Brasfield, M.A., NCC, is a recent graduate of Wake Forest 
University's counseling program. He is a professional school counselor at R.J. 
Reynolds High School in Winston-Salem, North Carolina. Jon plans to pursue fur- 
ther training in educational research methods and statistics in the near future. 

Carey Davis is obtaining her educational specialist degree in school psychology 
from Mississippi State University. Her areas of interest include academic assessment 
and intervention and group contingencies. 

Dimiter Dimitrov has a Ph.D. degree in mathematics education from the 
University of Sofia, Bulgaria and a Ph.D. degree in educational psychology from 
Southern Illinois University, Carbondale. Currently, he is an associate professor of 
educational measurement and statistics in the Graduate School of Education at 
George Mason University, Fairfax, Virginia. He is also editor of the professional jour- 
nal Measurement and Evaluation in Counseling and Development. Dr. Dimitrov's areas 
of expertise and teaching experience include classical and modern measurement the- 
ory, generalizability theory, and quantitative research methods. His recent research 
interests focus on validations of cognitive operations and processes using tools of 
item response theory and structural equation modeling, and on latent trait model- 
ing for measurement of change. 

R. Anthony Doggett, Ph.D., is an assistant professor in the school psychology 
program at Mississippi State University. Dr. Doggett received his doctorate in school 
psychology from the University of Southern Mississippi. He completed a predoctoral 
internship and a postdoctoral fellowship in behavioral pediatrics at the Munroe- 
Meyer Institute for Genetics and Rehabilitation in Omaha, Nebraska. His profes- 
sional interests include applied behavior analysis, functional behavioral assessment, 
behavioral consultation, parent training, instructional interventions, and behavioral 
pediatrics. 

Susan H. Eaves is a doctoral student in counselor education at Mississippi State 
University. Her research interests center around Borderline Personality Disorder, 
Conduct Disorder, and marital infidelity. She holds national certification and is a li- 
censed professional counselor. 

Catherine Flemming, M.A., NCC, is the director of Lay Ministry at Centenary 
United Methodist Church in Winston-Salem, North Carolina. As part of her church 
ministry, she places members in service opportunities appropriate for their gifts and 
interests. In addition, she provides individual, marital, premarital, and group coun- 
seling. She is a trained PREPARE/ENRICH administrator. 

Kathleen Hall completed her master's degree in the School Counseling Program 
of the Education Department at Loyola College in Maryland. She is currently a pro- 
fessional school counselor in Florida. 

Lauren Klein completed her master's degree in the School Counseling Program 
of the Education Department at Loyola College in Maryland. She is currently a high 
school counselor in Harford County Public Schools, Maryland. 



About the Authors xvii 

Lynn Linde is an assistant professor of education and the director of Clinical 
Programs in the School Counseling Program at Loyola College in Maryland. She re- 
ceived a master's degree in school counseling and a doctorate in counseling from 
George Washington University. Dr. Linde was previously chief of the Student 
Services and Alternative Programs Branch at the Maryland State Department of 
Education, the Maryland State specialist for school counseling, a local school system 
counseling supervisor, a middle and high school counselor, and a special education 
teacher. She has made numerous presentations on ethics and legal issues for coun- 
selors, and public policy and legislation over the span of her career. Dr. Linde is the 
recipient of the ACA Carl Perkins Award, the Association for Counselor Education 
and Supervision's Program Supervisor Award, and the Southern Association for 
Counselor Education and Supervision's Program Supervisor Award, as well as nu- 
merous awards from the Maryland Association for Counseling and Development 
and from the state of Maryland for her work in student services and youth suicide 
prevention. 

Kathleen McNinch completed her master's degree in the School Counseling 
Program of the Education Department at Loyola College in Maryland. She is cur- 
rently a high school counselor in Howard County Public Schools, Maryland. 

Michael D. Mong received a B.S. degree in psychology from Louisiana State 
University and is currently a Ph.D. student in school psychology at Mississippi State 
University. His research interests include language acquisition, behavior disorders, 
standardized versus nonstandardized testing procedures, and selective mutism. He is 
currently employed as a behavioral specialist with Head Start programs and is prima- 
rily responsible for student observations and assessments of both academics and be- 
havioral concern. 

Cheryl Moore-Thomas received her Ph.D. degree in counselor education from 
the University of Maryland. She is a national certified counselor. Currently, Dr. 
Moore-Thomas is an assistant professor of education in the school counseling pro- 
gram at Loyola College in Maryland. Over her professional career, she has published 
and presented in the areas of multicultural counseling competence, racial identity 
development of children and adolescents, and accountability in school counseling 
programs. 

Deborah Newsome, Ph.D., LPC, NCC, is an assistant professor of counseling 
at Wake Forest University, North Carolina, where she teaches courses in career coun- 
seling, appraisal procedures, and statistics and supervises master's degree students in 
their field experiences. In addition to teaching and supervising, Dr. Newsome coun- 
sels children, adolescents, and families at a nonprofit mental health organization in 
Winston-Salem, North Carolina. 

Masanori Ota is a graduate student pursuing an educational specialist degree in 
school psychology at Mississippi State University and is from Tokyo, Japan. Her re- 
search interests are functional behavioral assessment, functional behavioral analysis, 
and behavioral consultation in schools. 

Carol Salisbury is a doctoral student in the Pastoral Counseling Department at 
Loyola College in Maryland. Her research interests include exploring the positive as- 
pects of anger as a recuperative and useful emotion. 



xviii About the Authors 



Carl J. Sheperis, Ph.D., NCC, LPC, is an assistant professor in the Department 
of Counseling, Educational Psychology, and Special Education at Mississippi State 
University. Dr. Sheperis's areas of specialization include assessment and treatment of 
behavioral disorders and psychopathology. He is co-owner of Behavioral Research, 
Assessment, and Training Services LLC, a psychological corporation primarily serv- 
ing Head Start organizations. 




CHAPTER 



1 



Basic Assessment Concepts 

by Bradley T. Erford 



This initial chapter provides a whirlwind tour through the critical terminology, 
purposes, and standards related to assessment. Assessment is sometimes 
viewed as having a language all its own, so professional counselors are well ad- 
vised to learn this language in order to communicate with other professionals, and 
to advocate for, and make decisions in the best interests of, the clients and students 
they serve. 



ASSESSMENT AND COUNSELING 



Welcome to the world of counseling: a world of wonder, mystery, and fulfillment; a 
world where highly trained professional counselors attempt to understand and help 
people encountering trauma and challenges or adjusting to life circumstances; a 
world of clients and students (i.e., clients served by professional school counselors or 
college counselors) trying to get back on track. By nature, human beings are complex 
creatures made up of unique genetic structures and even more unique personal and 
psychosocial experiences. In the clinical sense, these factors combine to create clients 
and students who think, feel, and behave in individualistic ways — so individualistic 
that no clinician, no matter how skilled, can ever predict the client's actions with 
1 00% accuracy. In this sense, people are somewhat like puzzles — some simpler to 
understand and solve than others, but all with pieces that never quite seem to fit, or 
are even missing. Nevertheless, the more professional counselors know about a client 
or student, the better they can understand and predict how the individual will react 
under certain circumstances. 



Chapter 1 



This is what assessment is all about. It is integral to the counseling process; the 
professional counselor is always assessing. When a professional counselor first meets 
a student or client, the process of assessment for understanding begins. This process 
may be informal, formal, or somewhere in between; it may be structured, unstruc- 
tured, or somewhere in between. The point is, assessment begins from the moment 
the professional counselor meets the student or client: Data are collected, impres- 
sions are formed, and pieces of the puzzle are collected, analyzed, and fitted. 
Assessment continues as the professional counselor helps the student or client to se- 
lect therapeutic objectives and treatments. Assessment culminates in an evaluation of 
treatment outcomes to determine therapeutic success, or to obtain feedback indicat- 
ing that other treatment methods are needed. Assessment is counseling, and coun- 
seling is assessment. Indeed, assessment is integral to every stage in the counseling 
process (Whiston, 2005). 

We emphasize the interrelationship of assessment and counseling on the very 
first pages of this book because students new to the profession often show little ex- 
citement for a course in measurement or assessment. Unfortunately, counselor-edu- 
cators who teach counseling assessment sometimes report that counseling students 
rate it low (close to research and statistics courses) on the "exciting courses scale." So 
please make the connection between assessment and counseling early in the course 
and your career: Assessment is the quickest way to understand students and clients. 
The better one understands clients or students, the better and faster one will be able 
to help them. Assessment saves the client time, money, and (most importantly) so- 
cial and emotional pain. The more efficient a professional counselor becomes in 
knowing a student or client, the more effective and respected the counselor will 
become. 

The purpose of this book is to help professional counselors to understand the 
most efficient and effective means for discovering, analyzing, and fitting the puzzle 
pieces together to understand and help students and clients. The reader will no 
doubt discover that some of the methods described are faster, more effective, techni- 
cally more superior, and personally more appealing than others. There is wonderful 
diversity in how the puzzle pieces can be acquired and configured. Indeed, many cli- 
nicians assessing the same client through different methods may arrive at varying 
conclusions because of personal perspectives. Thus, in many ways this course, at its 
core, is about who you will become as a professional counselor. How will you discern 
the pieces of your developing professional identity, your strengths and weaknesses? 
How will you cope with the challenging coursework and its applications to clinical 
settings? What cognitive abilities, behavioral patterns, and personality dispositions 
will become barriers? Which will provide the resiliency needed to succeed? Let the as- 
sessment begin! 



WHAT IS ASSESSMENT? 



For all intents and purposes, and especially from a professional point of view, the 
terms assessment and appraisal Ave synonymous. In this book, we use the term assess- 
ment {or psychological assessment) consistently throughout. Assessment was defined in 



Basic Assessment Concepts 3 

Standards for Educational and Psychological Testing (AERAj APA/NCME, 1999, p. 3) 
as "a process that integrates test information with information from other sources 
(e.g., information from the individual's social, educational, employment, or psycho- 
logical history)." Note that the preceding definition distinguishes assessment from test, 
instrument, or inventory in that assessment includes testing as only part of its process. 
Many authoritative sources differ slightly in their definitions of what comprises a psy- 
chological test. An often-cited definition of a psychological test is that provided by 
Anastasi and Urbina (1997, p. 4): "an objective and standardized measure of a sample 
of behavior." Assessment integrates tests in a way that helps a professional counselor 
to better understand clients and make decisions in their best interests. 

Often overlooked, but implicit in the foregoing definition of a psychological test 
is the word measure. Measure implies that a quantity of some construct or concept 
will be determined: how much anxiety, intelligence, math skill, introversion, suici- 
dal ideation, alcohol use, artistic interest, antisocial tendency, etc. The purpose of an 
assessment is to give the professional counselor valuable information regarding "how 
much" of a given characteristic the student or client possesses. Knowing how much 
helps to predict client behaviors, strengths, and weaknesses, thus facilitating impor- 
tant treatment or life decisions. 

Second, assessments measure a sample of behavior. Behavior is what humans do, 
whether the "doing" be overt physical acts, emotional or affective displays, or cogni- 
tions that are conveyed to others. Sampling is key to understanding any psycholog- 
ical phenomenon. If a professional school counselor observes a student's activity level 
during different activities (e.g., physical education class, independent in-seat class 
work time, lecture presentation, lunchtime), these different samples of behavior will 
often lead to different observable data and subsequent conclusions. Likewise, in a 
clinical setting, professional counselors usually see clients only under fairly specific 
conditions (i.e., in an office), again leading to a specific sample of behavior. Samples 
of behavior assessed under various conditions are critical to understanding the stu- 
dent or client. These measures and observations allow professional counselors to 
make inferences about how clients will behave or perform under normal and unusual 
circumstances. Such inferences are indispensable to the client's insight and self-un- 
derstanding, as well as to the insight of the professional counselor charged with the 
responsibility of helping the client to develop goals and an effective treatment plan. 

When assessing a sample of behavior, it is important that the sample faithfully 
represent the total domain of behavior under study. For example, when assessing sin- 
gle-plus-single-digit addition without regrouping (i.e., 4 + 3, not 8 + 7), the test de- 
veloper needs to determine how many problems of this type are required to assess a 
student's mastery of the behavioral domain — that is, how many of the 57 possible 
single-plus-single-digit-addition-without-regrouping problems would a child need 
to successfully perform before the examiner could have confidence the student had 
mastered this type of addition? One? Two? Five? All 57? Efficient sampling of behav- 
ior is crucial to effective assessment. 

Sometimes the professional counselor is also interested in the perspectives of 
others (i.e., teacher, parent, spouse) who have observed a sample of the client's behav- 
iors under various conditions. These more indirect methods help professional coun- 



Chapter 1 



selors to provide insights into student or client behavior in other environments not 
easily accessed by the clinician. The common factor here is that the data collection, 
analysis, and judgment of professional counselors are influenced by tangible obser- 
vations of behavioral samples. But what if two professional counselors observe the 
same sample of behavior only to reach different conclusions? 

As a final piece of the definition of a psychological test, the terms standardized 
and objective are meant to work hand in hand to address counselor judgment as a 
potential source of error. Standardization refers to the systematic collection and 
analysis of data. Cronbach (1984) provided a comprehensive definition of standard- 
ization when he referred to a standardized test as one in which exact devices, mate- 
rials, verbal (or nonverbal) prompts, and scoring procedures have been fixed so that 
scores collected at various places and times and by different examiners are fully 
equivalent. Objective tests have scoring or observation criteria structured to such an 
extent that different examiners (e.g., trained judges, interviewers) have a very high 
likelihood of independently agreeing on a client's score on a given sample of per- 
formance behavior. To be sure, psychological assessments have varying degrees of 
standardization and objectivity. For example, on a multiple-choice test of written ex- 
pression for a 5th-grade student, different examiners may easily agree that answer 
choice b is correct, but when asked to determine the maturity of written expression 
in this student's essay, less agreement is likely, because the scoring of essays often in- 
volves more subjective (less objective) scoring criteria. Of course, the more standard- 
ized the written-expression assessment procedures and the more objective the scor- 
ing procedures, the greater is the likelihood of examiner agreement. 

Test developers strive to develop high-quality, accurate standardized and objec- 
tive tests (samples of behavior), and professional counselors strive to administer these 
instruments according to standardized procedures and to score each according to ob- 
jective criteria. Sounds like a perfect way to collect information about, and under- 
stand, a client, right? Unfortunately, even the best standardized and objective psy- 
chological assessments can lead to inaccurate conclusions. For example, the Reynolds 
Adolescent Depression Scale — Second Edition (RADS-2) (Reynolds, 2002), which will 
be discussed in Chapter 7, is incredibly easy to administer and score using the stan- 
dardized procedures, and is very objective. However, if students or clients do not 
want a professional counselor to think they are depressed, they need only to "fake 
good" on their test responses, and the test score will not indicate significant levels of 
depression. Thus, an unsuspecting professional counselor may not reach the appro- 
priate conclusion and may therefore not develop the most effective treatment plan 
for the client. 

Test developers and assessment specialists have developed countermeasures to 
help detect dishonesty and inaccurate responses — for example, some clinical, per- 
sonality, and behavioral inventories include validity scales. Also, professional coun- 
selors are trained to understand that all clients present information from their own 
point of view, and thus the counselor will seek validation of client perceptions from 
various sources of information (i.e., tests, inventories, rating scales, observations, in- 
terviews, questionnaires) and respondents (i.e., spouses, parents, teachers, peers) as 



Basic Assessment Concepts 5 

possible and appropriate. These issues involve the reliability and validity of scores 
and the decisions based on those scores and will be addressed throughout the re- 
mainder of this book. But prior to entering that realm, one must understand the 
multiple purposes for which professional counselors use assessment. 



The Purpose of Assessment 



At least four purposes of assessment have been identified in the extant literature 
(Erford, 2006; Gregory, 1999; Sattler, 2001): screening, diagnosis, treatment plan- 
ning and goal identification, and progress evaluation. 

Screening 

Screening is a quick procedure, usually involving a single measure, done for the pur- 
pose of determining whether deeper diagnostic assessment is necessary or warranted. 
A screening process is by no means comprehensive, and the instruments used for this 
purpose are sometimes held to lower standards of psychometric accuracy, although 
this is not always a desirable practice. Accuracy in screening is just as critical as ac- 
curacy in diagnosis because both procedures, done correctly, save students and clients 
emotional pain, time, and money. In all instances, professional counselors strive to 
use procedures that will maximize accurate decisions and minimize inaccurate deci- 
sions. For example, when conducting a screening procedure for depression, a profes- 
sional counselor will frequently use a self-report inventory of depression with a pre- 
determined cutoff to determine clinical significance. A client scoring above that 
cutoff score would be referred for further (diagnostic) assessment. Or, when a pro- 
fessional school counselor conducts a screening to determine which students are at 
risk for reading difficulties, students scoring below the predetermined level (perhaps 
< 25th percentile) will subsequently be referred for deeper-level assessments to fur- 
ther diagnose any reading difficulties and develop an effective treatment plan. 
Screening is an efficient first step in an assessment process because not every student 
or client needs diagnostic assessment. Diagnostic assessment tends to be more ex- 
pensive and more time consuming than screening and requires a greater level of skill 
to conduct, but there is a worthwhile trade-off in terms of efficiency and accuracy. 
Anastasi and Urbina (1997) referred to accurate identification decisions (some- 
times called hits) as true positives (clients who have a condition are identified by the 
screening test as having the condition) and true negatives (clients who do not have 
the condition are identified by the screening test as not having the condition). 
Inaccurate decisions (sometimes called misses) were referred to as false positives 
(clients who do not really have the condition are identified as having it) and false neg- 
atives (clients who really do have the condition are not identified as having it). (A 
graphic of these concepts can be found in Figure 4.2.) In screening procedures, pro- 
fessional counselors are most concerned with maximizing hits and minimizing 
misses, particularly false negatives, because these clients have the problem of concern 
but do not receive further diagnostic assessment to address the problem. They "slip 
through the cracks." 



Chapter 1 



Diagnosis 

Diagnosis entails "a detailed analysis of an individual's strengths and weaknesses, 
with the general goal of arriving at a classification decision" (Erford, 2006, p. 2). 
Diagnosis always involves more than one measure and often includes a battery of 
tests. Such a battery is usually composed of a series of tests that are integrated to yield 
specific information or identification decisions. For example, the Wechsler Intelligence 
Scale for Children — Fourth Edition (WISC-IV) (Wechsler, 2001a) and the Woodcock- 
Johnson: Tests of Achievement — Third Edition {WJ-III ACH) (Woodcock, Mather, & 
McGrew, 2001) are frequently used in conjunction to determine the existence and 
extent of learning disabilities in school-aged children. In some cases, diagnostic as- 
sessment can be used to enhance normal development, as when a client presents for 
career counseling and the professional counselor wants to assess the individual's in- 
terests, competencies, values, and interpersonal strengths and weaknesses to help the 
person to arrive at an acceptable career goal, educational plan, or vocational strategy. 
Similarly, in premarital counseling, which is currently becoming more popular, mar- 
riage and family counselors use diagnostic assessments to aid in leading couples to in- 
terpersonal and intrapersonal insights that will strengthen the bonds of the relation- 
ship and help the couple to predict and navigate the challenges of marriage and 
family life. 

In general, diagnosis in counseling can be construed as trying to understand 
what is happening with a client, what the problem is, what causes or maintains the 
problem, and what strengths the client may harness to overcome the problem. 
However, in clinical contexts, diagnostic assessment has classification or diagnosis as 
its goal. This process generally requires the use of a classification system, and most 
professional counselors in clinical practice use the Diagnostic and Statistical Manual 
of Mental Disorders — Fourth Edition — Text Revision (DSM-TV-TR) (APA, 2000). The 
DSM-IV- TR provides clinicians from all mental health professions (e.g., professional 
counselors, psychiatrists, psychologists, social workers) with a standardized set of cri- 
teria upon which to base a diagnosis (i.e., a clinical description) of a client's present- 
ing condition. Such a system facilitates accurate, reliable decisions and helps to in- 
form the professional counselor of appropriate treatment strategies. The DSM is to 
mental health practitioners what the International Classification of Diseases {ICD) is 
to physicians and to mental health workers in most other countries that do not use 
the DSM. However, there is disagreement in the counseling profession regarding the 
helpfulness of diagnosis to clients, as it frequently results in labeling of a client that 
may lead to a plethora of unintended and undesirable consequences (see Sattler, 
2001). 

Treatment Planning and Coal Identification 

Helping clients and students is what counseling is all about. Assessment helps clients 
and students to understand where they are and where they want to go, a key facet of 
developing a client's goals and objectives for counseling. A counseling process that 
does not have well-defined and measurable goals has no focus or direction, nor does 
it allow the client and professional counselor to know when the goals of counseling 






Basic Assessment Concepts 7 

have been achieved. Thus, a primary purpose of assessment in counseling is to help 
establish counseling goals, often through a combination of assessment methods, in- 
cluding interviewing and standardized testing. 

In addition, the information garnered from an initial assessment can be help- 
ful in planning a client's treatment. Frequently, student or client strengths, weak- 
nesses, challenges, and resiliency factors and resources are confirmed or better un- 
derstood through assessment procedures. "Treatment planning must flow logically 
from assessment results, fit the given environmental context of the client, and be 
individualized to mesh with the client's strengths and weaknesses" (Erford, 2006, 
p. 3). After the client and professional counselor agree on the goals and objectives 
to be pursued through counseling, the counselor must consider the most effective 
treatment options to obtain the desired outcomes. Thus a primary focus of the ini- 
tial assessment is to uncover student or client strengths and resources in order to 
plan for the most effective treatment. Of course, counseling would be incredibly 
simplified if specific test scores or client responses directly implied specific treat- 
ments or interventions. Unfortunately, the complexity of client problems rarely 
leads to such simplistic remedies. Important sources of information to help pro- 
fessional counselors with treatment planning are the outcomes research literature 
found in professional journals and compendiums of this research (e.g., Sexton, 
Whiston, Bleuer, & Walz, 1997; Whiston, 2003a). As a final note, treatment plan- 
ning usually gets easier with experience and, to some, may be more akin to art than 
science. In some employment settings, professional counselors often approach 
treatment of client problems from a theoretical paradigm that they are proficient 
in or comfortable with. When it comes to treatment planning, assessment often 
informs the professional counselor's practice. 

Progress Evaluation 

Once goals for counseling have been agreed on and treatment has begun, it is a pro- 
fessional counselor's responsibility to ensure that the treatment is helpful to a client 
(and, even more important, not harmful). This process is referred to as progress eval- 
uation or outcomes evaluation and, unfortunately, is frequently minimized in, or elim- 
inated from, a treatment regimen. Failure to periodically evaluate treatment progress 
is unethical and unprofessional, not to mention inefficient. If a treatment is having 
no positive effects and a professional counselor is not assessing its impact, the client 
is wasting time and money while continuing to experience the discomfort and emo- 
tional pain that brought the client to counseling. Tests and inventories can be very 
helpful aids in assessing treatment outcomes. 

The first step in evaluating progress is to establish a baseline measure of the stu- 
dent's or client's condition. This evaluation is generally done during an intake inter- 
view and initial assessment but can also be done at the time a counseling goal is es- 
tablished. Progress evaluation can be done formally or informally, subjectively or 
objectively. For example, an informal, subjective method would be to ask clients to 
rate their own feelings of anxiety (disorganization, depression, distractibility, etc.) on 
a scale from to 10, with being the total absence of anxiety and 10 being intense 



8 Chapter 1 



anxiety. If the client self-rates as a 9, this score becomes a baseline for comparison in 
future similar assessments, perhaps conducted at the beginning of each session over 
the following weeks. A more formal, objective method might involve a test such as 
the Beck Anxiety Inventory (BAT) (Beck, 1993). The client's initial score would serve 
as the baseline, and the counselor would periodically readminister the BAI to assess 
whether the client's anxious symptoms have declined. Furthermore, given the client's 
baseline score, it is possible to establish a goal of a certain score on the BAI as a tar- 
get to determine when the anxiety has subsided to a substantial enough degree that 
termination of counseling can be considered. 

The four purposes reviewed above provide a framework for the general use of 
assessment, but assessment is best applied to the practical aspects of counseling when 
fully integrated into the counseling process. The next section presents this fully in- 
tegrated model. 



HOW IS ASSESSMENT USED IN COUNSELING? 



As mentioned previously, assessment is counseling, and counseling is assessment. 
Assessment is totally integrated into the counseling process. Whiston (2005) re- 
ported that most counseling processes delineate at least the following four steps: 
(1) assessing client problems, (2) conceptualizing and defining client problems, 
(3) selecting and implementing effective treatments, and (4) evaluating counsel- 
ing effectiveness. 

In the first stage, professional counselors engage in screening and diagnostic assess- 
ment procedures to understand student or client concerns, issues, and problems. It is par- 
ticularly important that professional counselors conduct a comprehensive interview 
and administer appropriate tests and inventories to assess for broad functioning in 
the interest of "leaving no stone unturned." Incomplete assessments lead to incom- 
plete and ineffective treatment plans. It is best practice to ask these broad questions 
and conduct formalized assessments in the beginning of counseling rather than not 
ask, thus risking an underestimation of the scope of a problem or missing it alto- 
gether. The type of formal assessment used is often dependent on the nature of the 
setting and on the training and experience of the professional counselor. Elmore, 
Ekstrom, Diamond, and Whittaker (1993) reported that nearly three-quarters of the 
professional counselors surveyed indicated that assessments and tests were either im- 
portant or very important in their work setting. Predictably, the work of professional 
school counselors most frequently involved contact with achievement, intelligence, 
aptitude, and career or vocational measures (Elmore et al.; Giordano & Schweibert, 
1997), while the work of community and mental health counselors most frequently 
involved contact with clinical diagnostic, personality, intelligence, and vocational in- 
ventories (Bubenzer, Zimpfer, & Mahrle, 1990; Frauenhoffer, Ross, Gfeller, 
Searight, & Piotrowski, 1998). 

During the second stage of the counseling process, conceptualizing and defining 
problems, incomplete information will again limit a professional counselor's effective- 
ness (Mohr, 1995). Professional counselors must continuously assess their under- 
standing of client concerns during the process of constructing a working definition 



Basic Assessment Concepts 9 

of a client's problem. Counselors at this point must reciprocally rule in and rule out 
diagnostic categorizations and determine the frequency and severity of client con- 
cerns. Again, attention to comprehensiveness and detail at this stage will lead to a 
more effective treatment outcome. 

Treatment selection and implementation relies on an analysis of the results of as- 
sessments conducted during the first two stages of the counseling process. Again, the 
professional counselor questions the comprehensiveness of previous assessments and 
conducts additionaJ assessment as required. Most importantly, process evaluation be- 
gins at this time; it is the duty of the professional counselor to continuously assess the 
impact of the treatment strategies implemented. In evaluation parlance, this is re- 
ferred to as formative assessment and allows for midcourse adjustments in treatment 
implementation to provide the most effective treatment possible. Formative assess- 
ment helps determine whether or not progress is being made toward treatment goals. 

Finally, during the fourth stage of counseling, evaluation, determinations must 
be made regarding the overall effectiveness of treatment — a process that evaluation 
specialists refer to as summative evaluation or outcomes assessment. One of the reasons 
a baseline measurement is so highly recommended in counseling is that it provides 
a starting point for treatment and evaluation. Evaluation at the end of counseling 
provides another point of comparison that allows professional counselors to demon- 
strate to clients, students, and other stakeholders (i.e., employers, parents, insurance 
companies) that substantive, measurable gains have been noted, counseling goals 
have been met, and counseling services have been effective. 

By now the meaning of the statement "assessment is counseling, and counseling 
is assessment" should be amply clear. Indeed, there was a time, during the 1 930s and 
1940s, when assessment and counseling were viewed synonymously (Hood & 
Johnson, 2002). Assessment is an essential, integrated part of an effective counseling 
process. 



ASSESSMENT COMPETENCE 

AND PROFESSIONAL COUNSELORS 



Professional counselors have a professional responsibility to become competent in 
the effective use of assessment procedures. A number of professional associations, 
scholars, and accreditation organizations have taken the lead in specifying what pro- 
fessional counselors need to know and be able to do in order to demonstrate assess- 
ment competence, while others have focused on the question of why assessment 
competence is intrinsic to effective counseling. This section explores the why, while 
the section that follows focuses on the what (i.e., the training standards for profes- 
sional counselors). 

Whiston (2005) provided six reasons why professional counselors must become 
proficient in the use of assessment procedures. Assessment proficiency is a profes- 
sional expectation. The American Counseling Association's Code of Ethics (ACA, 
2005a) dedicated an entire section to an explanation of ethical uses of tests, and the 
Council for Accreditation of Counseling and Related Educational Programs 
(CACREP), an organization that accredits university counselor education programs, 



1 Chapter 1 



dedicated one of its eight core curricular areas to the study of assessment. As a result, 
the public expects professional counselors to be proficient in the use and interpreta- 
tion of tests. In fact, the use of formalized assessment can frequently lead to a per- 
ception of enhanced credibility on the part of clients (Goodyear, 1990; Sexton et al., 
1997). Efficient identification of problems usually results from the competent use of 
tests (Anastasi & Urbina, 1997; Duckworth, 1990), and this efficiency is normally 
increased when professional counselors use multimethod assessment batteries (Meyer 
et al., 2001) rather than general interviewing procedures. Likewise, multimethod 
and multirespondent assessment methods usually help professional counselors un- 
cover diverse, even unique, client information (Meyer et al.) and even lead to client or 
student insight and learning (Campbell, 2000; Sax, 1997). In addition, assessment 
helps identify strengths and weaknesses of clients and students, and professional coun- 
selors use this information to facilitate decision making (Drummond, 2000; Sax). 
Frequently, clients who "see" objective testing results documenting their interper- 
sonal and intrapersonal strengths and weaknesses develop the motivation to make 
life decisions and to adjust their life course accordingly. Insightful realizations and 
details of conversations that occur during the course of counseling are sometimes 
forgotten or minimized as time goes on. Assessment results provide a concrete, visual 
record that can be referred to time and again to bring the counseling back on course 
and to show measurable progress. Now that we have addressed the why of assessment 
in counseling, let us turn our attention to the "what." 



Training Standards for Professional Counselors 



The Council for Accreditation of Counseling and Related Educational Programs 
(CACREP) is the national organization, affiliated with the American Counseling 
Association, that accredits universities with counseling programs meeting rigorous 
professional and curricular standards. CACREP offers accreditation for masters-level 
specialty counseling programs in the areas of career counseling; college counseling; 
community counseling; marital, couple, and family counseling and therapy; mental 
health counseling; school counseling; student affairs counseling; and doctoral pro- 
grams in counselor education and supervision. The specific standard addressing the 
curricular requirements for assessment is Section II.K.7, found in Table 1.1. The 
reader will note that these standards align very well with the content of this book. 



Professional Counseling Organizations and Assessment 



Numerous professional counseling organizations and licensing or certification 
boards exist to promote best practices and develop policies and procedures that ad- 
vocate for client or student needs and protect the public from harm. The American 
Counseling Association (www.counseling.org) serves as the parent or umbrella or- 
ganization for all professional counselors and various professional counselor special- 
ties in the United States. In this context, counseling specialties (called divisions 
within ACAs structure) are defined as counselor practitioner entities that have a 
guild or occupational presence in the counseling profession and job market. The fol- 






Basic Assessment Concepts 1 1 



Table 1.1 Assessment curriculum standard from section II.K.7 
of the CACREP 2001 Accreditation Manual 

7. ASSESSMENT — studies that provide an understanding of individual and group 
approaches to assessment and evaluation, including all of the following: 

a. historical perspectives concerning the nature and meaning of assessment; 

b. basic concepts of standardized and nonstandardized testing and other assessment 
techniques including norm-referenced and criterion-referenced assessment, 
environmental assessment, performance assessment, individual and group test and 
inventory methods, behavioral observations, and computer-managed and computer- 
assisted methods; 

c. statistical concepts, including scales of measurement, measures of central tendency, 
indices of variability, shapes and types of distributions, and correlations; 

d. reliability (i.e., theory of measurement error, models of reliability, and the use of 
reliability information); 

e. validity (i.e., evidence of validity, types of validity, and the relationship between 
reliability and validity); 

f. age, gender, sexual orientation, ethnicity, language, disability, culture, spirituality, and 
other factors related to the assessment and evaluation of individuals, groups, and 
specific populations; 

g. strategies for selecting, administering, and interpreting assessment and evaluation 
instruments and techniques in counseling; 

h. an understanding of general principles and methods of case conceptualization, 
assessment, and/or diagnoses of mental and emotional status; and ethical and legal 
considerations. 



lowing are among the current 19 ACA divisions (specialty areas) with special inter- 
ests in the professional practice of assessment: 

■ American College Counseling Association (ACCA; www.collegecounseling.org) 

■ American Mental Health Counselors Association (AMHCA; www.amhca.org) 

■ American Rehabilitation Counseling Association (ARCA; www.arcaweb.org) 

■ American School Counselor Association (ASCA; www.schoolcounselor.org) 

■ Association for Assessment in Counseling and Education (AACE; 
http://aace.ncat.edu) 

■ Association for Counselor Education and Supervision (ACES; 
www.acesonline.net) 

■ International Association of Addiction and Offender Counselors (IAAOC; 
www.iaaoc.org) 

■ International Association of Marriage and Family Counselors (IAMFC; 
www.iamfc.com) 

■ National Career Development Association (NCDA; www.ncda.org) 

All of these organizations' websites, mailing addresses, and phone number con- 
tacts can be located through ACA's main website, www.counseling.org. 



1 2 Chapter 1 



Think About It 1 .1 Visit the ACA website at www.counseling.org or 
link to any of the websites individually listed above. Which professional or- 
ganizations offer services and products helpful to your development as a pro- 
fessional counselor? Which are you interested in joining? 



Another major influence in the counseling world is the American Psychological 
Association (APA; www.apa.org). APA serves as an umbrella organization for many 
other divisions dedicated to serving the public and the agenda of practitioner psy- 
chologists, some of whom are referred to as counseling psychologists. APA divisions 
serving specialties similar to ACA divisions include: 

■ Division 17 — Society of Counseling Psychology (www.divl7.org) 

■ Division 22 — Rehabilitation Psychology (www.apa.org/divisions/div22) 

■ Division 28 — Psychopharmacology and Substance Abuse (www.apa.org 
/divisions/div28) 

■ Division 29 — Psychotherapy (www.divisionofpsychotherapy.org) 

■ Division 42 — Psychologists in Independent Practice (www.division42.org) 

■ Division 43 — Family Psychology (www.apa.org/divisions/div43) 

■ Division 50 — Addictions (www.apa.org/divisions/div50) 

A number of additional national associations exist that are not affiliated with 
ACA or APA, but which have substantial counselor and therapist memberships and 
legislative agendas, including: 

■ American Association for Marriage and Family Therapy (AAMFT; www 
.aamft.org) 

■ Association for Addiction Professionals (NAADAC; www.naadac.org) 

■ National Association of Social Workers (NASW; www.NASWDC.org) 

Finally, all states have licensing boards that regulate the practice of psychology 
and/or counseling within their borders. Because laws and regulations vary substan- 
tially from state to state, necessary qualifications and what professional counselors 
can do when practicing within these states also vary. Add to this the turf wars be- 
tween psychologist and professional counselor licensing boards and professional as- 
sociations that flare up in various states around the country, and the whole issue of 
which assessments and tests professional counselors can administer and interpret, 
where, and when can become quite confusing. It is unlikely that this situation will 
change anytime soon. It is incumbent upon professional counselors to stay abreast of 
practice developments within their state. 



Assessment Training Standards 



The area of psychological assessment is perhaps among the most contentious and 
hard-fought battlegrounds in counseling. As this book goes to press, battles between 
psychologists and professional counselors over the right to use psychological tests in 



Basic Assessment Concepts 1 3 

clinical practice are being fought in California, Indiana, Illinois, Louisiana, and 
Maryland. Organizations, including the ACA, AACE, Association of Test Publishers 
(ATP; www.testpublishers.org), and Fair Access Coalition on Testing (FACT; 
www.fairaccess.org), are leading a national effort to allow qualified psychologists and 
counselors access to psychological tests in clinical practice. An ongoing stumbling 
block to access has been forging agreement on the term qualified. ACA recently de- 
veloped a position statement on test user qualifications with the goal that the docu- 
ment would serve as a consensus-building device (see Box 1.1). 



Box 1.1 ACA Policy Statement on Test User Qualifications 

Standards for Qualifications of Test Users 

American Counseling Association 

The professional qualifications essential to the use of tests in counseling arise 
from a synthesis of knowledge, skills, and ethics. While some professional 
groups are seeking to control and restrict the use of psychological tests,* the 
American Counseling Association believes firmly that one's right to use tests 
in counseling practice is directly related to competence. This competence is 
achieved through education, training, and experience in the field of testing. 
Thus, professional counselors with a master's degree or higher and appropri- 
ate coursework in appraisal/assessment, supervision, and experience are qual- 
ified to use objective tests. With additional training and experience, profes- 
sional counselors are also able to administer projective tests, individual 
intelligence tests, and clinical diagnostic tests. This training may occur in 
graduate school, in post-grad professional development instruction, or in su- 
pervised training in use of the test. Professional counselors are qualified to 
use tests and assessments in counseling practice to the degree that they pos- 
sess the appropriate knowledge and skills, including the following areas: 

1 . Skill in practice and knowledge of theory relevant to the testing context 
and type of counseling specialty. 

Assessment and testing must be integrated into the context of the theory and 
knowledge of a specialty area, not as a separate act, role, or entity. In addi- 
tion, professional counselors should be skilled in treatment practice with the 
population being served. 

2. A thorough understanding of testing theory, techniques of test construc- 
tion, and test reliability and validity. 

Included in this knowledge base are methods of item selection, theories of 
human nature that underlie a given test, reliability, and validity. Knowledge 
of reliability includes, at a minimum: methods by which it is determined, 

*For the purpose of this document, terms such as inventory, instrument, measure, and scale are en- 
compassed by the terms test or assessment. 

continued 



1 4 Chapter 1 



Box 1.1 continued 

such as domain sampling, test-retest, parallel forms, split-half, and inter-item 
consistency; the strengths and limitations of each of these methods; the stan- 
dard error of measurement, which indicates how accurately a person's test 
score reflects their true score of the trait being measured; and true score the- 
ory, which defines a test score as an estimate of what is true. Knowledge of 
validity includes, at a minimum: types of validity, including content, crite- 
rion-related (both predictive and concurrent), and construct methods of as- 
sessing each type of validity, including the use of correlation; and the mean- 
ing and significance of standard error of estimate. 

3. A working knowledge of sampling techniques, norms, and descriptive, 
correlational, and predictive statistics. 

Important topics in sampling include sample size, sampling techniques, and 
the relationship between sampling and test accuracy. A working knowledge of 
descriptive statistics includes, at a minimum: probability theory; measures of 
central tendency; multi-modal and skewed distributions; measures of variabil- 
ity, including variance and standard deviation; and standard scores, including 
deviation IQ's, z-scores, T scores, percentile ranks, stanines/stens, normal 
curve equivalents, grade- and age-equivalents. Knowledge of correlation and 
prediction includes, at a minimum: the principle of least squares; the direc- 
tion and magnitude of relationship between two sets of scores; deriving a re- 
gression equation; the relationship between regression and correlation; and 
the most common procedures and formulas used to calculate correlations. 

4. Ability to review, select, and administer tests appropriate for clients or 
students and the context of the counseling practice. 

Professional counselors using tests should be able to describe the purpose 
and use of different types of tests, including the most widely used tests for 
their setting and purposes. Professional counselors use their understanding of 
sampling, norms, test construction, validity, and reliability to accurately as- 
sess the strengths, limitations, and appropriate applications of a test for the 
clients being served. Professional counselors using tests also should be aware 
of the potential for error when relying on computer printouts of test inter- 
pretation. For accuracy of interpretation, technological resources must be 
augmented by a counselor's firsthand knowledge of the client and the test- 
taking context. 

5. Skill in administration of tests and interpretation of test scores. 

Competent test users implement appropriate and standardized administra- 
tion procedures. This requirement enables professional counselors to provide 
consultation and training to others who assist with test administration and 
scoring. In addition to standardized procedures, test users provide testing en- 
vironments that are comfortable and free of distraction. Skilled interpreta- 
tion requires a strong working knowledge of the theory underlying the test, 



Basic Assessment Concepts 1 5 

test's purpose, statistical meaning of test scores, and norms used in test con- 
struction. Skilled interpretation also requires an understanding of the simi- 
larities and differences between the client or student and the norm samples 
used in test construction. Finally, it is essential that clear and accurate com- 
munication of test score meaning in oral or written form to clients, students, 
or appropriate others be provided. 

6. Knowledge of the impact of diversity on testing accuracy, including age, 
gender, ethnicity, race, disability, and linguistic differences. 

Professional counselors using tests should be committed to fairness in every 
aspect of testing. Information gained and decisions made about the client or 
student are valid only to the degree that the test accurately and fairly assesses 
the client's or student's characteristics. Test selection and interpretation are 
done with an awareness of the degree to which items may be culturally bi- 
ased or the norming sample not reflective or inclusive of the client's or stu- 
dent's diversity. Test users understand that age and physical disability differ- 
ences may impact the client's ability to perceive and respond to test items. 
Test scores are interpreted in light of the cultural, ethnic, disability, or lin- 
guistic factors that may impact an individual's score. These include visual, 
auditory, and mobility disabilities that may require appropriate accommoda- 
tion in test administration and scoring. Test users understand that certain 
types of norms and test score interpretation may be inappropriate, depend- 
ing on the nature and purpose of the testing. 

7. Knowledge and skill in the professionally responsible use of assessment 
and evaluation practice. 

Professional counselors who use tests act in accordance with the ACA's Code 
of Ethics and Standards of Practice (2005 a); Responsibilities of Users of 
Standardized Tests — Third Edition {RUST-3) (AACE, 2003a); Code of Fair 
Testing Practices in Education (JCTP, 2002); Rights and Responsibilities of Test 
Takers: Guidelines and Expectations (JCTP, 2000); and Standards for 
Educational and Psychological Testing (AERA/APA/NCME, 1999). In addi- 
tion, professional school counselors act in accordance with the American 
School Counselor Association's (ASCA's) Ethical Standards for School 
Counselors (ASCA, 1992). Test users should understand the legal and ethical 
principles and practices regarding test security, using copyrighted materials, 
and unsupervised use of assessment instruments that are not intended for 
self- administration. When using and supervising the use of tests, qualified 
test users demonstrate an acute understanding of the paramount importance 
of the well-being of clients and the confidentiality of test scores. Test users 
seek on-going educational and training opportunities to maintain compe- 
tence and acquire new skills in assessment and evaluation. 

continued 



1 6 Chapter 1 



Box 1.1 continued 

References 

American Counseling Association. (2005a). Code of Ethics and Standards of 

Practice. Alexandria, VA: Author. 
American Educational Research Association, American Psychological 

Association, National Council on Measurement in Education. (1999). 

Standards for Educational and Psychological Testing. Washington, DC: 

American Educational Research Association. 
American School Counselor Association. (1992). Ethical Standards for School 

Counselors. Alexandria, VA: Author. 
Association for Assessment in Counseling. (2003a). Responsibilities of Users of 

Standardized Tests (RUST). Alexandria, VA: Author. 
Joint Committee on Testing Practices. (2000). Rights and Responsibilities of 

Test Takers: Guidelines and Expectations. Washington, DC: Author. 
Joint Committee on Testing Practices. (2002). Code of Fair Testing Practices 

in Education. Washington, DC: Author. 

Note: Reprinted with permission from the American Counseling Association. No further reproduc- 
tion authorized without written permission from the American Counseling Association. 
Note: Approved by the American Counseling Association (ACA) Governing Council in March 2003, 
Anaheim, CA. The Standards for Test Use Task Force was an ad hoc committee of the American 
Counseling Association. The following counseling and education assessment professionals con- 
tributed to the drafting of this document: Dr. Bradley T. Erford (Chair), Mr. Alan Basham, Dr. Janet 
Wall, Dr. Craig S. Cashwell, and Dr. Gerald Juhnke. 



The Association for Assessment in Counseling and Education's (AACE) 
Responsibilities of Users of Standardized Tests — Third Edition (RUST-3) (AACE, 
2003a) statement is one of the most important documents speaking to standards for 
test users. The RUST-3 statement addresses the issues of test user qualifications, tech- 
nical knowledge, test selection, test administration, test scoring, interpreting test re- 
sults, and communicating test results. 

AACE is a division of ACA and has been collaborating with the practitioner 
divisions of ACA (i.e., divisions that serve employment groups, such as school, 
mental health, substance abuse, and marriage and family counselors) to develop 
training standards for each specialty area. The goal of this initiative is to standard- 
ize the assessment training within various counseling specialty areas so that all pro- 
fessional counselors emerging from a counselor education program will have the 
knowledge, skill, and training to use psychological tests relevant to their clinical 
practice. The documents shown in Exhibits l.a and l.b, obtained from the 
AACE/International Association for Addiction and Offender Counselors 
(IAAOCC) and the AACE/American School Counselor Association (ASCA), 
contain current assessment training standards for the specialty areas of substance 
abuse counseling and school counseling. Assessment standards for mental health 
counselors, career counselors, and marriage and family counselors are still under 



Basic Assessment Concepts 1 7 




ASSOCIATION FOR ASSESSMENT 
IN COUNSELING AND EDUCATION 

Standards for Assessment in Substance Abuse Counseling 

These training standards provide a description of the knowledge and skills needed by substance abuse counselors in 
the areas of assessment and evaluation. Because effectiveness in assessment and evaluation is critical to effective 
counseling, these training standards are important for substance abuse counselor education and practice. Consistent with 
existing Council for Accreditation of Counseling and Related Educational Programs (CACREP) standards for preparing 
counselors, they focus on standards for individual counselors and the content of counselor education programs. The 
standards, which represent aspirations for competent professional practice, can be used by counselor and assessment 
educators as a guide in the development and evaluation of substance abuse counselor preparation programs, workshops, 
in-services, and other continuing education opportunities. They may also be used by substance abuse counselors to 
evaluate their own professional development and continuing education needs. 

During training, substance abuse counselors should meet each of the following assessment standards and have the 
specific skills listed under each standard. 

Standard I. Substance abuse counselors are able to assess the effects and withdrawal symptoms of 
commonly abused drugs. Substance abuse counselors can: 

1. Assess for and recognize acute intoxication syndromes for commonly abused chemicals (i.e., alcohol, benzodiaz- 
epines, marijuana, cocaine). 

2. Assess for and recognize withdrawal complications (i.e., seizures, delirium tremens, hallucinations). 

3. Assess for and recognize the effects of cross-addiction and dual addiction disorders. 

4. Assess for and recognize symptoms of inhalant use (e.g. the smell of fuel on clothes, red eyes, runny nose, 
cough). 

Standard II. Substance abuse counselors can assess the broad spectrum of concomitant disorders. Substance 
abuse counselors can: 

1 . Assess for other addictive disorders (i.e., gambling, food, sex). 

2. Determine if a psychological disorder (i.e., anxiety, depression, panic, Post Traumatic Stress Disorder) was present 
prior to, or the result of, clients' substance use. 

3. Assess for Attention-Deficit/Hyperactive Disorder (AD/HD). 

4. Assess for suicidal or homicidal ideation. 

5. Assess for the presence or possibility of domestic violence. 

6. Use and interpret the results of adult and adolescent intelligence instruments. 

Standard III. Substance abuse counselors are skilled in evaluating the technical quality and appropriateness 
of testing instruments. Substance abuse counselors can: 

1 . Identify acceptable reliability levels for instruments. 

2. Identify appropriate types of validity for commonly-used instruments. 

3. Evaluate the procedures used to validate commonly-used instruments. 

4. Locate testing instruments and information about instruments for special populations (e.g. visually impaired, 
nonreaders). 

5. Use computerized assessment instrument. 

6. Articulate the limitations of commonly-used instruments within the substance abuse counseling field. 

Standard IV. Substance abuse counselors are knowledgeable regarding qualitative assessment procedures 
including structured and semi-structured clinical interviews. Substance abuse counselors: 

1. Are familiar with the advantages and disadvantages of structured and semi-structured clinical interviews. 

2. Are familiar with qualitative assessment procedures (e.g. role playing, life line assessments, direct and indirect 
observations). 

3. Understand the advantages and disadvantages of qualitative assessment procedures. 

4. Understand the concepts of continuous assessment and wraparound services. 



Exhibit l.a Standards for Assessment in Substance Abuse Counseling 

Source: Reprinted by permission of the Association for Assessment in Counseling/ American Counseling Association. 



1 8 Chapter 1 



Standard V. Substance abuse counselors employ multiple methods when assessing clients and monitoring 
the efficacy of treatment Substance abuse counselors: 

1. Use paper and pencil or computerized instruments and structured interviews, as appropriate. 

2. Whenever possible, consult with and interview family, friends, and other corroborating sources of information, 
while always obtaining written consent to gather information from sources other than the client. 

3. Monitor client progress throughout the counseling process. 

Standard VI. Substance abuse counselors are skilled in interpreting assessment results with clients. 
Substance abuse counselors can: 

1. Interpret assessment results in a helpful manner that emphasizes clients' strengths as well as possible problem 
areas. 

2. Explain to clients the steps that are necessary to share testing results with others (e.g. informed consent). 

Standard VII. Substance abuse counselors are skilled in using assessment results to develop and evaluate 
effective treatment interventions. Substance abuse counselors can: 

1 . Accurately score, analyze, and interpret the results of testing. 

2. Create specific treatment plans based upon the results of testing. 

Standard VIII. Substance abuse counselors are aware of the need for professional development within the 
assessment area. Substance abuse counselors: 

1. Participate in training needed to keep abreast of new assessment instruments, procedures, and issues. 

2. Keep up to date with advancements in the field of assessment by reading the appropriate professional journals, 
test manuals, and reports. 

3. Join professional associations that provide relevant assessment and substance abuse information. 

Standard IX. Substance abuse counselors are aware of the appropriate use of assessment instruments in 
research. Substance abuse counselors use assessment instruments: 

1 . To determine the efficacy of their interventions. 

2. Appropriate for the intended population/clients. 

3. In accordance with the American Counseling Association's Ethical Standards, Code of Fair Testing Practices, 
Standards for Educational and Psychological Testing, Responsibilities of Users of Standardized Tests, and Test 
Takers' Rights and Responsibilities. 

Standard XI. Counselor educators and supervisors of substance abuse counselors-in-training are able 
to effectively train counselors in the area of substance abuse assessment Counselor educators and 
supervisors: 

1 . Keep current with scholarship related to how to teach counselors-in-training how to best use assessment 
instruments in their work with clients. 

2. Are knowledgeable in the selection, use, evaluation, and interpretation of assessment instruments. 

Definitions of Terms 

Assessment: active collection of information about individuals, populations, or treatment programs. 

Instruments: standardized or nonstandardized tests, interviews, rating scales, inventories, or checklists used by mental health counselors 

to better understand the client; the client's past history; the client's current social, employment, physical or interpersonal 

environment; the client's intellectual functioning; the client's personality; or the client's presenting concerns. 
Standards: minimal levels of skill, knowledge, or training. 
Structured clinical interviews: clinical interviews with individuals, couples, families, or groups in which the mental health counselor asks 

questions precisely as directed by the instrument's author(s). Questions are posed in the order defined by the authors, and 

responses are recorded according to specific directions. 
Unstructured clinical interviews: clinical interview in which the mental health counselor is free to pursue related lines of inquiry to gain 

needed or pertinent information. 

Source: Reprinted with permission from the Association for Assessment in Counseling and Education. No further reproduction authorized without 
written permission from the Association for Assessment in Counseling and Education. 

Wofe. These standards were developed as a joint effort between the Association for Assessment in Counseling and Education (AACE) and the 
International Association of Addictions and Offenders Counselors (IAAOC). The joint committee included Dr. Bradley T. Erford (Chair), Dr. Gerald 
Juhnke, Dr. Russell Curtis, Mr. Joe Jordan, Dr. Kenneth Coll. 



Exhibit 1. a continued 




COMPETENCIES IN ASSESSMENT AND EVALUATION FOR SCHOOL COUNSELORS 

Approved by the American School Counselor Association 
/^l^[[^te\ on September 21, 1998, 

^^^^^^Jni5jM*wmu^ and by the Association for Assessment in Counseling 

on September 10, 1998' 

The purpose of these competencies is to provide a description of the knowledge and skills that school counselors need in the areas 
of assessment and evaluation. Because effectiveness in assessment and evaluation is critical to effective counseling, these competencies 
are important for school counselor education and practice. Although consistent with existing Council for Accreditation of Counseling and 
Related Educational Programs (CACREP) and National Association of State Directors of Teacher Education and Certification (NASDTEC) 
standards for preparing counselors, they focus on competencies of individual counselors rather than content of counselor education 
programs. 

The competencies can be used by counselor and assessment educators as a guide in the development and evaluation of school 
counselor preparation programs, workshops, inservice, and other continuing education opportunities. They may also be used by school 
counselors to evaluate their own professional development and continuing education needs. 

School counselors should meet each of the nine numbered competencies and have the specific skills listed under each competency. 

Competency 1. School counselors are skilled in choosing assessment strategies. 

a. They can describe the nature and use of different types of formal and informal assessments, including questionnaires, checklists, 
interviews, inventories, tests, observations, surveys, and performance assessments, and work with individuals skilled in clinical 
assessment. 

b. They can specify the types of information most readily obtained from different assessment approaches. 

c. They are familiar with resources for critically evaluating each type of assessment and can use them in choosing appropriate 
assessment strategies. 

d. They are able to advise and assist others (e.g., a school district) in choosing appropriate assessment strategies. 

Competency 2. School counselors can identify, access, and evaluate the most commonly used assessment instruments. 

a. They know which assessment instruments are most commonly used in school settings to assess intelligence, 
aptitude, achievement, personality, work values, and interests, including computer-assisted versions and other 
alternate formats. 

b. They know the dimensions along which assessment instruments should be evaluated, including purpose, validity, 
utility, norms, reliability and measurement error, score reporting method, and consequences of use. 

c. They can obtain and evaluate information about the quality of those assessment instruments. 

Competency 3. School counselors are skilled in the techniques of administration and methods of scoring assessment 

instruments. 

a. They can implement appropriate administration procedures, including administration using computers. 

b. They can standardize administration of assessments when interpretation is in relation to external norms. 

c. They can modify administration of assessments to accommodate individual differences consistent with publisher 
recommendations and current statements of professional practice. 

d. They can provide consultation, information, and training to others who assist with administration and scoring. 

e. They know when it is necessary to obtain informed consent from parents or guardians before administering an 
assessment. 

Competency 4. School counselors are skilled in interpreting and reporting assessment results. 

a. They can explain scores that are commonly reported, such as percentile ranks, standard scores, and grade 
equivalents. They can interpret a confidence interval for an individual score based on a standard error of 
measurement. 

b. They can evaluate the appropriateness of a norm group when interpreting the scores of an individual or a group. 

c. They are skilled in communicating assessment information to others, including teachers, administrators, students, 
parents, and the community. They are aware of the rights students and parents have to know assessment results 
and decisions made as a consequence of any assessment. 

d. They can evaluate their own strengths and limitations in the use of assessment instruments and in assessing 
students with disabilities or linguistic or cultural differences. They know how to identify professionals with 
appropriate training and experience for consultation. 

e. They know the legal and ethical principles about confidentiality and disclosure of assessment information and 
recognize the need to abide by district policy on retention and use of assessment information. 

Source: Reprinted with permission from the Association tor Assessment in Counseling and Education. No further reproduction authorized without written 
permission from the Association for Assessment in Counseling and Education. 

'A joint committee of the American School Counselor Association (ASCA) and the Association for Assessment in Counseling (AAC) was appointed by the 
respective presidents in 1993 with the charge to draft a statement about school counselor preparation in assessment and evaluation. Committee 
members were Ruth Ekstrom (AAC), Patricia Elmore (AAC, Chair, 1997-1999), Daren Hutchinson (ASCA), Marjorie Mastie (AAC), Kathy O'Rourke (ASCA), 
William Schafer (AAC, Chair, 1993-1997), Thomas Trotter (ASCA), and Barbara Webster (ASCA). 



Exhibit l.b Competencies in Assessment and Evaluation for School Counselors 



20 Chapter 1 



Competency 5. School counselors are skilled in using assessment results in decision making. 

a. They recognize the limitations of using a single score in making an educational decision and know how to obtain multiple 
sources of information to improve such decisions. 

b. They can evaluate their own expertise for making decisions based on assessment results. They also can evaluate the limitations of 
conclusions provided by others, including the reliability and validity of computer-assisted assessment interpretations. 

c. They can evaluate whether the available evidence is adequate to support the intended use of an assessment result for decision 
making, particularly when that use has not been recommended by the developer of the assessment instrument. 

d. They can evaluate the rationale underlying the use of qualifying scores for placement in educational programs or courses of 
study. 

e. They can evaluate the consequences of assessment-related decisions and avoid actions that would have unintended negative 
consequences. 

Competency 6. School counselors are skilled in producing, interpreting, and presenting s ta tistical information about 

assessment results. 

a. They can describe data (e.g., test scores, grades, demographic information) by forming frequency distributions, preparing tables, 
drawing graphs, and calculating descriptive indices of central tendency, variability, and relationship. 

b. They can compare a score from an assessment instrument with an existing distribution, describe the placement of a score within 
a normal distribution, and draw appropriate inferences. 

c. They can interpret statistics used to describe characteristics of assessment instruments, including difficulty and discrimination 
indices, reliability and validity coefficients, and standard errors of measurement. 

d. They can identify and interpret inferential statistics when comparing groups, making predictions, and drawing conclusions 
needed for educational planning and decisions. 

e. They can use computers for data management, statistical analysis, and production of tables and graphs for reporting and 
interpreting results. 

Competency 7. School counselors are skilled in conducting and interpreting evaluations of school counseling programs 

and counseling-related interventions. 

a. They understand and appreciate the role that evaluation plays in the program development process throughout the life of a 
program. 

b. They can describe the purposes of an evaluation and the types of decisions to be based on evaluation information. 

c. They can evaluate the degree to which information can justify conclusions and decisions about a program. 

d. They can evaluate the extent to which student outcome measures match program goals. 

e. They can identify and evaluate possibilities for unintended outcomes and possible impacts of one program on other programs. 

f. They can recognize potential conflicts of interest and other factors that may bias the results of evaluations. 

Competency 8. School counselors are skilled in adapting and using questionnaires, surveys, and other assessments to 

meet local needs. 

a. They can write specifications and questions for local assessments. 

b. They can assemble an assessment into a usable format and provide directions for its use. 

c. They can design and implement scoring processes and procedures for information feedback. 

Competency 9. School counselors know how to engage in professionally responsible assessment and evaluation 

practices. 

a. They understand how to act in accordance with ACA's Code of Ethics and Standards of Practice and ASCA's Ethical Standards for 
School Counselors. 

b. They can use professional codes and standards, including the Code of Fair Testing Practices in Education, Code of Professional 
Responsibilities in Educational Measurement, Responsibilities of Users of Standardized Tests, and Standards for Educational and 
Psychological Testing, to evaluate counseling practices using assessments. 

c. They understand test fairness and can avoid the selection of biased assessment instruments and biased uses of assessment 
instruments. They can evaluate the potential for unfairness when tests are used incorrectly and for possible bias in the interpreta- 
tion of assessment results. 

d. They understand the legal and ethical principles and practices regarding test security, copying copyrighted materials, and 
unsupervised use of assessment instruments that are not intended for self-administration. 

e. They can obtain and maintain available credentialing that demonstrates their skills in assessment and evaluation. 

f. They know how to identify and participate in educational and training opportunities to maintain competence and acquire new 
skills in assessment and evaluation. 

Definitions of Terms 

Competencies describe skills or understandings that a school counselor should possess to perform assessment and evaluation activities 
effectively. 

Assessment is the gathering of information for decision making about individuals, groups, programs, or processes. Assessment targets 
include abilities, achievements, personality variables, aptitudes, attitudes, preferences, interests, values, demographics, and other 
characteristics. Assessment procedures include but are not limited to standardized and unstandardized tests, questionnaires, inventories, 
checklists, observations, portfolios, performance assessments, rating scales, surveys, interviews, and other clinical measures. 
Evaluation is the collection and interpretation of information to make judgments about individuals, programs, or processes that lead to 
decisions and future actions. 



Exhibit l.b continued 



Basic Assessment Concepts 21 

development. Efforts such as these have the goal of standardizing and formalizing 
the education and training required for professional counselors in various specialty 
areas to effectively use psychological tests. 

The right and responsibility to administer, score, and interpret psychological 
and educational tests involve the concerted efforts of professional counselors, legis- 
lators, state counseling board members, government bureaucrats, test publishers, ad- 
vocates, professional associations and affiliates, and the public. Protection of this 
right to test must occur continuously on several fronts, including laws, regulations, 
ethics, professional training, and professional practice. Professional counselors are 
encouraged to join professional associations and become actively engaged in legisla- 
tive and regulatory advocacy to benefit and protect the public safety and right to ac- 
cess quality, affordable counseling services. 



ASSESSMENT TERMS AND CONCEPTS 



The field of assessment contains many concepts that are essential to understand and 
remember. These concepts vary in degree of simplicity, familiarity, and abstractness. 
The list of terms and concepts presented in this section also serves as a way of clas- 
sifying and describing most tests that professional counselors will encounter and use. 
One of the things that makes assessment such a challenging area of study is its new 
and unusual terminology, causing some professional counselors to suggest that as- 
sessment is a language unto itself. In that spirit, the reader is well advised to spend 
the time needed to master the concepts in the remainder of this chapter. These con- 
cepts are the building blocks for understanding the field of assessment and for com- 
prehending the content in the remainder of this book and in the published test man- 
uals one will encounter. 



Standardized (Formal) and Nonstandardized (Informal) Tests 



Standardized tests have specific conditions for administration, timing, and scoring. 
This systematic process ensures that no matter who the examiner or examinee, the 
test will be administered under strict, replicable conditions. Standardized procedures 
allow comparability of scores and interpretations across different examinees and for 
the same examinee across administration times. Nonstandardized tests and other in- 
formal measures do not provide systematic measurements, nor are the administra- 
tion and scoring criteria fixed. Thus nonstandardized tests do not allow for compa- 
rability across examinees or administration times. In addition, standardized tests 
attempt to conform to rigorous test construction guidelines for establishing the re- 
liability and validity of scores, whereas nonstandardized tests may not. 

It is essential to understand that each method has advantages and disadvantages. 
For example, when interviewing, the professional counselor can use a structured in- 
terview (standardized), an unstructured interview (nonstandardized), or a semi-struc- 
tured interview (standardized format with leeway for unstructured questioning). The 
advantage of the structured interview is that different professional counselors inter- 
viewing the same client will likely reach the same conclusion because they ask the 
same questions and will probably get the same answers. This enhances the reliability 



22 Chapter 1 



(and probably the validity) of the procedure. On the other hand, different profes- 
sional counselors interviewing the same client using an unstructured interview will 
ask different questions, will likely get different results, and will possibly reach differ- 
ent conclusions. The use of nonstandardized procedures more frequently leads to 
variable results because of a lack of systematic methodology. 



Norm-Referenced and Criterion-Referenced Tests 



In most cases, standardized tests are administered to a representative sample of par- 
ticipants, called a standardization sample, to determine average performances for 
various subgroups of interest (e.g., age, grade, male, female). These subgroups are 
often called a norm group. A client's score on this norm-referenced test can then 
be compared to the standardization sample results to determine where the client's 
score falls within that distribution of scores (i.e., Average, Above Average, Below 
Average). Thus norm-referenced tests allow comparison of a person's score to the 
scores of a comparison group with like characteristics (e.g., sex, age) that has al- 
ready taken the test. Norm-referenced tests are commonly used to assess intelli- 
gence, achievement, perceptual skills, personality, and behavior. Often the raw 
score obtained by a client is transformed into some type of standard score or per- 
centile rank. Note that the client's score simply indicates the individual's position 
relative to others in the sample, not whether the client "passed" or "failed" the test 
or is diagnosed with some mental disorder. Such judgments require the use of a 
criterion. 

Criterion-referenced tests compare a person's score to a predetermined standard 
or level of performance — a criterion. Often a criterion-referenced test is administered 
to a standardization sample to help establish the criterion scores. Criterion-refer- 
enced tests are common in education because most teacher-made tests and perform- 
ance-based assessments have a standard for determining successful performance. For 
example, on a high-stakes state achievement test, a criterion for passing may be es- 
tablished at a cutoff score of 79; thus any student scoring at 79 or higher has "passed" 
the test; those below 79 did not. Likewise, on a depression screening test, a clinician 
may determine that scores of 20 and higher require further diagnostic evaluation, so 
a client receiving a score of 16 on the screening test would not meet the minimum 
criterion. Many DSM-IV-TR diagnostic checklists are set up to facilitate criterion- 
referenced decision making. For example, a diagnosis of Generalized Anxiety 
Disorder requires the documentation of three or more of the six specific listed diag- 
nostic criteria to a significant degree. 

While most tests are designed to be norm referenced or criterion referenced, 
some diagnostic, clinical, and research decisions are made by applying criterion- 
referenced standards to norm-referenced results. For example, it is widely believed 
that the prevalence of Attention-Deficit/Hyperactivity Disorder (AD/HD) in the 
childhood population is about 5%. The Conners' Teacher Rating Scale — Revised 
(CTRS-R) (Conners, 1997) is a norm-referenced behavior rating scale commonly 
used in assessing AD/HD. The CTRS-R yields a T score (M = 50; SD = 10). 
Applying the principles of the normal curve, it can be determined that a T score 
of 67 or higher would represent the highest 5% (most hyperactive, most dis- 



Basic Assessment Concepts 23 

tractible) of a school-aged population. Thus, even though the CTRS-R is a norm- 
referenced test, a clinician or researcher could use a criterion cut-score of T > 67 
to identify children with AD/HD. 



Individual and Group Tests and Inventories 



Some tests and inventories are designed to be administered to only a single exami- 
nee at a time; others are designed for administration to groups of participants simul- 
taneously. The advantages of group tests are speed and efficiency. At the same time, 
there are limitations in the type of group administration formats available, usually 
involving paper-and-pencil and response booklet or Scantron (bubble) formats. 
Professional school counselors most frequently use or encounter group assessments 
involving achievement, aptitude, and ability within large-scale testing programs 
(Gibson & Mitchell, 1999). A major drawback of group-administered assessment is 
the inability to observe all examinees and control the factors that sometimes influ- 
ence student performance, the most important of which is student motivation. 

Individual tests are often used for diagnostic decision making and generally re- 
quire some interaction between the examiner and examinee. They allow the exam- 
iner to establish rapport, reduce anxiety, observe verbal and nonverbal behaviors, and 
pace the evaluation by providing breaks to decrease fatigue. Often the tasks admin- 
istered in an individual test require special training, expertise, materials, and timing 
or scoring procedures that require individual attention. The individual administra- 
tion format also gives the student or client the opportunity to demonstrate a deeper 
mastery of skill by allowing the examiner to query responses and provide instruction, 
and the examinee to clarify questions and task demands. 



Objective and Subjective Tests 



The terms objective test and subjective test refer to the method of scoring used in a 
given testing procedure. Objective tests leave no doubt as to the correctness of a 
given answer; correct answers are predetermined and require no judgment on the 
part of the examiner. As a result, regardless of who scores the test, the result will be 
the same. Multiple-choice, true-false items are examples of objectively scored ques- 
tions. Subjective tests require the examiner to make a judgment on the quality of 
the response in scoring an item. Essay, constructed-response, and open-ended ques- 
tions ordinarily require some judgment. Objective items help to control subjective 
bias in scoring procedures (i.e., help to improve interscorer reliability). Many client 
characteristics assessed by professional counselors can be determined by objective 
methods; other characteristics or issues in the lives of clients are more easily assessed 
through subjective methods. 



Speed and Power Tests 



Different tests have differing classifications of item difficulty and response rates. 
Speeded tests generally include a large number of simple items. The task is to meas- 
ure how many of the simple items a person can complete within a certain amount 



24 Chapter 1 



of time. The rest is structured so that very few, if any, examinees complere all of the 
items, and the score is simply the number of (correct) items completed within the 
time limit (i.e., a person's response rate). Tests of fluency and processing speed com- 
monly use speeded procedures. For example, the Math Fluency subtest of the WJ-III 
(Woodcock, Mather, & McGrew, 2001) presents the examinee with 160 simple cal- 
culation problems (i.e., 2 + 4 = ?, 1x4 = ?) within a three-minute time limit. The 
examinee writes the number answer for each problem. The items are so simple that 
very few errors are made, and the persons raw score is the number of items correct. 
Obviously, the faster the examinee can compute and respond to simple math calcu- 
lation problems, the higher the score. 

A power test generally has fewer items, but they are of varying levels of diffi- 
culty, and there are no time limits. The examinee can take as much time as needed 
to work each problem, and the score is the number of items responded to correctly. 
In some instances, more difficult items may be worth more points than less difficult 
items. This kind of examination is called a power test because the score is an indica- 
tor of the skills or abilities possessed by the examinee, without the pressure of time 
limits. Generally, some items are so difficult that perfect scores are rare. When meas- 
uring math computation skill, the Math Calculation subtest from the WJ-III may 
be used. This subtest presents math calculation problems of varying difficulty levels 
(i.e., 3 + 4 = ?; 420 x 24 = ?; 3 /4 - X A = ?; 2x+ 1 = 13, therefore x = ?), and the exam- 
inee's raw score is an indicator of the amount of math skill possessed. The items vary 
in difficulty, and most examinees eventually miss many items in a row (i.e., reach the 
ceiling level), at which time administration of the subtest ceases. The more proficient 
an examinee is in math calculation, the higher the person's score. 

Interestingly, even though some tests are classified as pure speeded tests or pure 
power tests, many tests include both facets — that is, they are designed as power tests 
with varying item difficulties but are administered under time limits. Usually, these 
time limits are sufficient for the majority of test takers to complete the examination. 
However, slower (for whatever reason) test takers often run out of time. For exam- 
ple, the Scholastic Assessment Test {SAT), commonly used for college admissions de- 
cisions by American universities, is designed as a power test with items of widely 
varying difficulties, but it is administered under time-limited conditions. 
Importantly, time limit constraints frequently put disabled examinees at a distinct 
disadvantage, which is why many students and adults with documented learning dis- 
abilities or who receive accommodations under Section 504 of the U.S. 
Rehabilitation Act of 1973 petition for and receive extended time accommodations. 



Verbal and Nonverbal Tests 



Some verbal tests rely heavily on language usage, particularly oral or written re- 
sponses. These verbal responses require an examinee first to understand or compre- 
hend instructions, questions, and other task demands; then to verbally mediate and 
construct an appropriate response; and finally to deliver an oral or written response 
that passes the scoring criterion for the item. Even if a task does not require a verbal 
response, if the instructions are given orally, some verbal skill is required. Over the 



Basic Assessment Concepts 25 








oo 



A 





1 2 

Figure 1.1 Matrix Design 



past several decades, professional counselors have become acutely aware of the im- 
pact of culture on language development and usage, particularly with persons for 
whom English is not their primary language. 

On the other hand, nonverbal tests require students and clients to solve and re- 
spond to problems without the use of language. Sometimes these tests are called non- 
language tests, or performance tests. (Note: The use of the term performance in the con- 
text of nonverbal assessment here differs somewhat from its use in the section on 
performance assessment later in this chapter.) For example, on a typical matrix anal- 
ogy test, an examinee may be asked to look at several related designs and to select 
from among several choices the design that would either complete the pattern or pre- 
dict which design would appear next in the sequence (see Figure 1.1). Or, with block 
pattern items, such as those found on the Slosson Intelligence Test — Primary (SIT-P) 
(Erford, Vitali, & Slosson, 1999) (see Figure 1.2), a client may be given several cubes 
(all black on two sides, all white on two sides, and half black-half white on the other 
two sides), shown a picture of the blocks making a certain design, then asked to put 
the blocks together so they look just like the picture. Such tasks minimize verbal 
input and require spatial, figural, or visual processing skills — all nonverbal intellec- 
tual processes. 

It is easy to assume that someone who is very intelligent would excel at both ver- 
bal and nonverbal tasks, that someone with average intelligence would perform in an 



26 Chapter 1 




Figure 1.2 Pattern Design 



average capacity on verbal and nonverbal tasks, and that someone who is not very in- 
telligent at all would do poorly on both types of tasks. Indeed, this is very frequently 
the case, though by no means always. An intelligent non-English-speaking client or 
a learning-disabled student may struggle tremendously on verbally laden tasks (as 
would be expected) while performing in an outstanding manner on the nonverbal 
tasks. Because culture influences language, examiners must take extra measures to 
ensure the fairness of the examination (i.e., be unbiased). On the other hand, indi- 
viduals with some degree of brain damage or a visual processing disorder, or those 
who have an accelerated learning environment, may demonstrate verbal capabilities 
far superior to their nonverbal capabilities. Most tests have some verbal component, 
even if it is only some brief verbal or written instructions. It is the examiner's legal, 
ethical, and professional responsibility to ensure that examinees receive a fair, unbi- 
ased assessment that reflects the examinee's abilities to the greatest extent possible. In 
all instances, professional counselors must take into account the extent to which lan- 
guage and cultural influences may affect student or client results. 



Cognitive and Affective Tests 



Cognitive ability tests generally fall into three categories: intelligence, aptitude, and 
achievement. They all measure, to various degrees, perceptual, processing, memory, 
and reasoning capabilities. Intelligence tests measure a person's ability to learn, solve 
problems, and understand increasingly complex or abstract information. Commonly 
used tests of intelligence include the Wechsler Adult Intelligence Scale — Third Edition 
(WAIS-III) (Wechsler, 1997) and the Stanford-Binet Intelligence Scale — Fifth Edition 
(SBIS-5) (Roid, 2003). Aptitude tests, in general, predict a person's capacity to perform 
some skill or task in the future (e.g., college, a training program). Aptitude tests have 
broad educational and vocational applications. For example, the SA T has been used 
for decades by university admissions personnel to determine which college applicants 
are likely to do well in college (actually, the freshman year of college). Also, the 
Differential Aptitude Tests (DAT) (The Psychological Corporation, 1991a) are com- 
monly used as part of a vocational assessment battery to help high school students un- 
derstand the potential vocational strengths and weaknesses each possesses. 
Achievement tests are commonly used in education to measure knowledge students 
have acquired through instruction or training up to a certain point in their academic 



Basic Assessment Concepts 27 

career. Achievement tests can be norm referenced (comparing the examinee with other 
students) or criterion referenced (comparing the examinee with a standard of mas- 
tery). Nearly all teacher-made, classroom-administered tests are achievement tests and 
are usually criterion referenced. However, many individually administered diagnos- 
tic- and screening-level tests have been developed, including the WJ-III (Woodcock, 
McGrew, & Mather, 2001); the Wecbsler Individual Achievement Test — Second Edition 
{WIAT-II) (Wechsler, 2001b); and the Peabody Individual Achievement Test — Revised 
{PIAT-R) (Markwardt, 1998). Also, most states have mandated high-stakes achieve- 
ment testing programs and contract with test publishers to develop standardized 
achievement tests that align with specific state educational standards. 

Affective assessment is a broad category that, in general, assesses all noncogni- 
tive features of an individual, including temperament, clinical disposition, personal- 
ity, attitudes, values, and interests. Both structured and unstructured assessments are 
commonly used in affective assessment. Professional counselors frequently use struc- 
tured (formal) personality inventories for diagnostic purposes, hypothesis testing, treat- 
ment planning, and progress evaluation. Commonly used structured inventories in- 
clude the Minnesota Multiphasic Personality Inventory — II (MMPI-2) (Butcher et al., 
1992); the Millon Clinical Multiaxial Inventory — III (MCMI-III) (Millon, Davis, & 
Millon, 1997); and the Strong Interest Inventory (Harmon, Hansen, Borgen, & 
Hammer, 1994). Unstructured (informal) assessment often involves the use of projec- 
tive techniques and qualitative methods. Projective techniques are based on psycho- 
analytic theory and normally present the client with unstructured, ambiguous stim- 
uli, allowing the client to "project" thoughts and feelings onto the stimulus. 
Examples of ambiguous stimuli include inkblots, pictures, incomplete sentences, or 
even a single word. Such unstructured tasks give the client great latitude in how to 
respond or as to the content of the response, and it is incumbent upon the profes- 
sional counselor to analyze and interpret the responses to yield insights into a clients 
motivation, personality, values, and so forth. An advantage of a projective technique 
is that because it is ambiguous and there are no right or wrong answers, it is difficult 
for clients to fake responses. Their responses were simply based on what came into 
their mind at the time they responded to the task. A disadvantage of projective tech- 
niques is that some of the tests require extensive education and training. Examples 
of projective tests include inkblot techniques such as the Rorschach Inkblot Test 
(Rorschach, 1969); picture-story techniques such as the Thematic Apperception Test 
(TAT) (Murray & Bellak, 1973) and Robert's Apperception Test for Children 
(McArthur & Roberts, 1994); drawing and query techniques such as the House-Tree- 
Person (H-T-P) (Van Hutton, 1994) and Kinetic Drawing System for Family and 
School (Knoff & Prout, 1985); and completion techniques such as incomplete sen- 
tences or word association. 



Maximum and Typical Performance Measurement 



In maximum performance measurement, the professional counselor strives to assess 
the best performance of which the examinee is capable. In this way, the examiner has 
a good estimate of the upper level of achievement or ability at which the client could 
be expected to perform. When conducting diagnostic assessment for the determina- 



28 Chapter 1 



don of a learning disability, the examiner strives to obtain maximum ability and 
achievement estimates because such decisions have important long-term implications. 
In typical performance measurement, the professional counselor seeks to ob- 
tain a sample of the client's performance under normal circumstances, or on a "typ- 
ical day." Professional counselors conducting clinical, personality, or vocational as- 
sessments often strive for typical performance estimates to understand the client's 
performance under normal circumstances. In this way, the professional counselor 
gets to know the client's habitual thoughts, feelings, interests, and behaviors. 



Behavioral Observations 



Unfortunately, many people view assessment only as the administration of tests. But 
assessment of any kind relies heavily on behavioral observations, observations that 
begin at the moment the professional counselor speaks to or meets the client or stu- 
dent for the first time. Observations can be conducted through either direct or indi- 
rect means. One common form of direct observation is direct behavioral assessment, in 
which the professional counselor is actually physically present in the same environ- 
ment with the client and uses a data collection procedure to assess the frequency, du- 
ration, and/or magnitude of one or more target behaviors. For example, a professional 
school counselor may observe a 2nd-grade student referred for overactivity by using a 
time-on-task observation system. Briefly, such a procedure allows the counselor to ob- 
serve the frequency of the target student's (the student suspected of being hyperac- 
tive) and of one or two control students' (students of the same sex, but not suspected 
to be substantially hyperactive) motor on-task behavior during classroom activities. 
Such observations allow the counselor to determine whether the target student is sub- 
stantially more overactive than other children of the same age. Anecdotal observations 
are also commonly used and allow the observer to document in a narrative format 
what was observed during an observation period. The purpose of an anecdotal report 
is to describe client behaviors in some detail so that, over time, a rich understanding 
of the factors surrounding the behavior can be obtained. Often special training is re- 
quired of observers to minimize bias and enhance inter-observer reliability (i.e., agree- 
ment between the observations or ratings of two or more observers). 

Behaviors can also be assessed through indirect observation, usually using behav- 
ior rating scales or checklists. These instruments ask questions of people (e.g., spouse, 
parent, teacher, peer) in a good position to observe the typical behavior of a student 
or client and provide responses that give the professional counselor multiple perspec- 
tives and valuable clinical insights. Some behavioral disorders (e.g., AD/HD) require 
that problematic behaviors be observed in more than one setting, and behavior rat- 
ing scales completed by parents or teachers help to verify student or client difficul- 
ties in a time-efficient manner. 



Basals, Starting Points, and Ceilings 



Many intelligence, aptitude, and achievement tests present items in order of increas- 
ing difficulty. For example, most subtests on the Woodcock-Johnson: Tests of 
Achievement— Third Edition (WJ-III ACH) (Woodcock, Mather, & McGrew, 2001) 






Basic Assessment Concepts 29 

present items in approximate order from least difficult to most difficult. Likewise, 
the Slosson Intelligence Test — Revised {SIT-R) (Nicholson & Hipshman, 1990) pres- 
ents 1 87 verbal ability items in an order that approximates least to most difficult. 
This hierarchical ordering allows administration procedures that substantially en- 
hance efficiency and speed. Because the items are in approximate order from least 
difficult to most difficult, it is logical to assume that if a student gets item 1 1 correct, 
the odds are good that the student would also get items 1-10 correct, because each 
is easier than item 1 1 . One can easily see how much faster administration would be 
if any examinee getting item 11 correct would not need to answer items 1-10. Of 
course, this is only an assumption, and exceptions do occur on a frequent basis. 
However, test developers have determined that the probability of violating this as- 
sumption diminishes tremendously when a series of consecutive items is used. A 
basal series is a predetermined number of consecutive, correct items that must be 
obtained by an examinee in order to eliminate the need to administer numerous eas- 
ier items on the same test or subtest. For example, the SIT-R requires a basal series 
of 10 in a row correct, while many subtests on the WJ-III ACH require a basal of 6 
in a row correct. Establishing a basal series gives the examiner confidence that, if the 
examinee were administered all the items preceding the basal, the examinee would 
get them all correct. Again this is an assumption, but one backed up by substantial 
statistical probability. The assumption is generally true in 95% or more of the cases, 
and when it is not true, the examinee almost never misses more that one or two of 
the easier items. Thus the examinees' scores ordinarily are not substantially inflated. 

Of course, there is no need to establish a basal series if all examinees begin ad- 
ministration with item 1 . That is why many test developers establish starting points 
for administration based on the age or grade of the examinee. For example, an 8- 
year-old being administered the 1 87-item SIT-R would ordinarily begin with item 
55, a 13-year-old at item 105. These starting points are usually designated by deter- 
mining the point at which nearly all (i.e., 95%) of 8-year-olds or 13-year-olds will 
get the first item correct and go on to obtain the required basal series of 10 in a row 
correct. For example, on the SIT-R, the examiner would begin administration to an 
8-year-old with item 55, then continue until the basal series has been obtained. If the 
student gets item 55 correct and responds correctly to items 56-64, the basal series 
requirement has been met, and administration of the test items continues. 

Different tests vary as to the proper procedure to follow if one of the items is 
missed during the attempt to establish the basal series. Some require the examiner to 
stop forward administration and administer the items in reverse order until the basal 
has been established. As an example, imagine that an 8-year-old student being ad- 
ministered the SIT-R answers items 55—60 correctly, then misses item 61. Because 
the SIT-R requires a basal of 10 items and the student has only 6 in a row correct, 
the professional counselor is required to return to item 54 and administer the items 
in reverse order until the student responds to 10 items correctly (i.e., 54, 53, 52, 51). 
At that point, having established the required basal series, the professional counselor 
returns to item 62 and administers the remaining items until a ceiling is reached. 
This example is provided in Figure 1.3. 

A ceiling series is the number of incorrect items an examiner must obtain before 
test administration can be halted. The concept of a ceiling is based on the same 



30 Chapter 1 




INDIVIDUAL TEST FORM 



SIT-R 

SLOSSON INTELLIGENCE TEST 

Richard L Slosson 

Revised by: Charles L. Nicholson, Terry L. Hibpshman 



Nam* (John 5oy\ 


0~acK 




LAST 


FIRST 


MIDDLE 


AHrlrpss 






Srhnnl/Agpnny 


Spy M firade 3 Parpnt 




Rpfprrpd Ry 






NAME 




POSITION 


Fxaminpr 






NAME 




POSITION 


Cnmmpnts- 









Test Results: 

Chronological Age (CA) 


3-2 

Yrs.-Mos 


Raw Score 


Total Standard Score (TSS) 

Mean Age Equivalent (MAE) 

T-Score 


«?i 


8-3 


Normal Curve Equivalent (NCE) 

Stanine Category 

Percentile Rank (PR) 






31 


Confidence Interval (95%)or 99%) . . 
(circle interval used) 


°l2.)t>4-\oo 



Mark the questions with a (1) for passing or a (0) for failing. Begin testing where examinee can pass "10 in a row" without 
making a mistake. Continue testing until examinee misses "10 in a row." Refer to Manual for more complete directions. 

NOTES 



1. 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 
10. 
11. 
12. 
13. 
14. 
15. 
16. 
17. 
18. 
19. 
20. 
21. 
22. 
23. 
24. 
25. 
26. 
27. 
28. 
29. 
30. 



31. 
32. 
33. 
34. 
35. 
36. 
37. 
38. 
39. 
40. 
41. 
42. 
43. 
44. 
45. 
46. 
47. 
48. 
49. 



61. o »fc r r 
62°4^?H92. 



,.91. 



63. 
64. 
65. 
66. 
67. 
68. 
69. 
70. 
71. 
72. 
73. 
74. 
75. 



j 

o 

i 

o 

o 

o 

o 

o 

o 

o 

o 

o 

*c«il 



51. 
52. 
53. 

54. 



_j*55*5!: 

56. _L_ 

57. _L_ 

58. _L_ 

59. _L_ 

60. _L_ 



76. 
77. 
78. 
79. 
80. 
81. 
82. 
83. 
84. 
85. 
86. 
87. 
88. 
89. 
90. 



93 

94 

95 

96 

97 

98 

99 

100 

101 

102 

103 

.104 

J?05._ 

106 

107 

108 

109 

110 

111 

112 

113 

114 

115 

116 

117 

118 

119 

120 



121. 

122. 

123. 

124. 

125. 

126. 

127. 

128. 

129. 

130. 

131. 

132. 

133. 

134. 

135. 

136. 

137. 

138. 

139. 

140. 

141. 

142. 

143. 

144. 

145. 

146. 

147.. 

148. 

149. 

150. 



151. 

152. 

153. 

154. 

155. 

156. 

157. 

158. 

159. 

160. 

161. 

162. 

163. 

164. 

165. 

166. 

167. 

168. 

169. 

170. 

171.. 

172.. 

173.. 

174.. 

175.. 

176.. 

177.. 

178.. 

179.. 

180. . 



181 

182 

183 

184 

185 

186 

187 

Basal Item 

Questions 
passed after 
basal item + . 

Raw Score 
(total of above) 

Ceiling Item _ 



GO 



(S3 






Figure 1.3 Protocol for Slosson Intelligence Test— Revised 

Source: Copyright 1991, Slosson Educational Publications, Inc. All rights reserved. Reprinted with permission from Slosson Educational 
Publications, Inc. No further reproduction authorized without permission from Slosson Educational Publications, Inc. 



Basic Assessment Concepts 3 1 



#5 (Drawing Apples) 


#75 (Drawing Apples) 


#17 (Drawing Apples) 


#23 (Drawing Apples) 


#14 "Which of these squares is smaller?" 


















/// 














#109 Illustrate latitude and longitude. 



Figure 1.3 continued 



32 Chapter 1 



Reliability 



assumptions as those underlying a basal — that is, because the items continue on in 
order of increasing difficulty, if a student misses item 60, there is a statistical likeli- 
hood that the individual would miss items 61 and above because these items are even 
more difficult. As with the basal series, the accuracy of that assumption is bolstered 
when the test developer specifies a certain number of items in a row that must be 
missed before administration can stop. The WJ-III A CH generally specifies a ceiling 
of 6 incorrect items to cease administration of a subtest. The SIT-R specifies a ceil- 
ing of 10 errors in a row. Continuing with our SIT-R example (see earlier discussion 
and Figure 1.3), suppose the student responds correctly to item 62 and 63, misses 
64, gets 65 correct, but then misses items 66-75. Missing the last 10 items in a row 
fulfills the requirement of the ceiling series, so administration of the test ceases, and 
the student is given points for all items above the ceiling series. The professional 
counselor can then complete the scoring of the SIT-R protocol and transform the 
raw score into standard scores and percentile ranks for interpretation. Note that the 
assumption is that missing 10 items in a row means the student is very unlikely to 
get any of the even more difficult items correct. Again, this assumption is almost al- 
ways valid, but there is a negligible statistical probability that some examinees may 
get one or more additional items correct should the administration of items con- 
tinue. Just as before, denying an examinee 1 or 2 additional raw score points proba- 
bly will not substantially suppress an examinee's overall score. 

One can easily see the timesaving efficiency and benefits of using basal and ceil- 
ing series. In our SIT-R example, the professional counselor administered only items 
46-69. This means that a student's score was determined by administering only 24 
of the 187 total items (about 13%). Basal and ceiling series can thus be tremendously 
time saving without compromising on accuracy and meaningfulness. In addition, 
these procedures save clients and students from having to endure the tedious admin- 
istration of numerous items that are far too simple, and the emotional frustration of 
having to deal with numerous items that are far too difficult. 



Reliability is discussed in detail in Chapter 3. For now, it is important to know that 
reliability means consistency. If a client receives an IQ score of 70 one day and 130 
the next, what helpful decision could a professional counselor make about a client's 
life? If a client's score cannot be consistently measured, it is of little use. 

Reliability of scores can be determined through a variety of means, each of 
which assesses a different type of score error. For example, test-retest reliability in- 
volves determining the relationship (correlation) between scores on the administra- 
tion of the same test to the same participants on two different occasions (e.g., one 
hour, two weeks, one month, one year apart). The resulting coefficient is a measure 
of the test scores' stability over time — essential information when trying to consis- 
tently track a client's or student's performance or response to treatment over a given 
period of time. 

It is important to understand from the start that no test is reliable or unreliable. 
It is test scores that possess the characteristic of consistency. Most importantly, the 



Validity 



Basic Assessment Concepts 33 

reliability of test scores varies across samples of participants. For example, it is likely 
that the reliability coefficients derived from scores on a substance abuse inventory 
will vary substantially depending on whether clients who abuse substances or clients 
who do not abuse substances are used in the sample. 



Validity means usefulness. Validity of scores can be determined through a variety of 
means, each of which provides evidence of a different type of usefulness. Content- 
related validity is a systematic examination of the items making up a test to deter- 
mine the comprehensiveness of content coverage. This type of validity is particu- 
larly relevant for academic achievement tests because academic areas generally have 
a well-established domain of behavior, and sampling is critical to deriving useful 
generalizations from any derived score. For example, if a mathematics test is com- 
posed only of addition problems, the scores may be valid indicators of a person's 
addition skills but may be substantially less useful in predicting a person's overall 
mathematical abilities. 

Criterion-related validity involves a test's ability to predict some criterion, either 
at the present time {criterion concurrent) or at some point in the future {criterion pre- 
dictive). Many criteria are commonly used for comparison, and score validity is gen- 
erally expressed as a specialized correlation coefficient known as a validity coefficient. 
If one is attempting to validate scores from a new anxiety inventory, one may choose 
criteria such as previously existing anxiety scales, behavioral observations, or diag- 
nostic categorizations (i.e., previously diagnosed or currently diagnosable). 

Construct validity helps determine what a test measures (the idea or construct) 
and how well it measures it. A construct is a relatively abstract idea that cannot be 
measured directly, but which can be inferred. Intelligence, depression, introversion, 
self-esteem, and locus of control are all examples of constructs. Constructs can be 
validated through a variety of methods, including factor analysis, correlations with 
other tests, and convergent or discriminant techniques. Chapter 4 covers in detail 
each of these classical methods for determining validity of scores, as well as decision- 
making strategies using these scores. 

As with the concept of reliability, no test is valid or invalid. It is test scores that 
possess the characteristic of usefulness, and the validity of test scores varies across 
samples of participants and according to the various purposes that a test is intended 
to address. For example, it is likely that the validity of scores on a measure of self-es- 
teem will vary substantially depending on the characteristics of the clients being as- 
sessed; such as when a more homogeneous sample of adolescent Hispanic females with 
eating disorders is studied, as opposed to a more heterogeneous sample of culturally 
diverse males and females without diagnosable pathology. Likewise, that same self- 
esteem scale may provide excellent, accurate predictions of academic self-esteem, but 
only moderately accurate predictions of academic performance (i.e., grades, test 
scores) and poor predictions of a client's degree of depressive symptoms. Different 
tests are designed for different purposes and for use with different populations. 
Validity is the study of these uses and populations. 



34 Chapter 1 



Formative Versus Summative Evaluation 



Tests are often used to evaluate curricular and treatment programs. When a test is 
administered during the course of treatment or instruction with the purpose of in- 
forming the evaluator as to the intervention's effectiveness, it is called a formative 
evaluation. Such a practice allows for midcourse adjustments and modifications 
to more effectively meet the final goal or objective. If an assessment is adminis- 
tered on completion of instruction or treatment, it is referred to as a summative 
evaluation. The purpose of summative evaluation is to determine whether a goal 
or objective has been met. How effective the assessment is in making this determi- 
nation depends on the preciseness of the goal or objective, the alignment of the 
treatment with the goal, and the alignment of the assessment with both the goal 
and the treatment. For example, many counseling programs administer the 
Counselor Preparation Comprehension Examination (CPCE) (administered by the 
National Board of Certified Counselors [NBCC]) near the end of the program of 
study as a summative evaluation. The test is predicated on core educational areas 
(assessment, the topic of this text, is one of the more challenging core areas) and 
well-defined educational standards. The test is composed of items selected to ac- 
curately reflect the various domains of knowledge and the importance of each do- 
main. Thus the CPCE is very well aligned with the standards it was designed to 
measure. Professional counselors participate in a graduate counseling program of 
study that prepares them for professional practice. Success on the exam is very 
much related to how closely the graduate program curriculum aligns with the test's 
standards, the skill of the instructors, factors listed in Table 7.1 (factors that affect 
student or client test performance), and the tenacity with which the students pur- 
sue and master the course contents. Stated another way, success on the CPCE, used 
as a summative evaluation, is enhanced by well-designed programs with good in- 
struction and, even more importantly, motivated, competent students. Study hard! 



Paper-and-Pencil Tests and Performance (Authentic) Assessments 



A paper-and-pencil test requires examinees to mark an answer choice, either through 
the historically literal practice of using a pencil or through more recent computer- 
based innovations such as clicking the correct answer displayed on a computer 
screen. These tasks frequently rely heavily on verbal capabilities because they require 
reading and verbal comprehension. 

Performance assessments, sometimes called authentic or alternative assessments, 
minimize verbal task demands but require the student or client to manipulate ma- 
terials or to select visual stimuli without using language, or at least by substantially 
minimizing the use of language. There is a big difference between completing a 
multiple-choice test on how to rebuild car engines (a paper-and-pencil test) and 
actually rebuilding a car engine (a performance test). Performance assessment in- 
volves the evaluation of an examinee's product, action, or behavior. A strength of 
performance assessment is that it allows the individual to demonstrate a more com- 
prehensive, real-life, hands-on understanding of a topic or dilemma. Performance 
assessments have been used for years in vocational training and gifted education 



Basic Assessment Concepts 35 

programs, not to mention physical education, woodshop, metal shop, and home 
economics classes. Some state departments of education have implemented high- 
stakes performance assessment systems to assess students' depth of understanding 
by presenting them with a dilemma to be solved and the materials and time to solve 
it. Such procedures are expensive and time consuming but allow examiners to de- 
termine whether students develop necessary insights and follow desired procedures 
en route to solving complex problems. Performance assessment is sometimes done 
with less of an emphasis on reading and writing, thus minimizing the effects of 
verbal and linguistic capabilities. But this is not always the case. Some states use 
the manipulation of physical objects and props to solve a problem and then require 
the student to write a summary composition describing the various components 
of the performance task. 

Professional training programs frequently use performance assessments. For ex- 
ample, counselors-in-training frequently present videotapes of counseling sessions 
for analysis and evaluation, and interns and practicum students are sometimes ob- 
served and evaluated in live counseling, consultation, or classroom sessions. 
Instructors or supervisors then observe the demonstrations, evaluate and judge each 
performance according to some scoring scheme (usually involving a scoring rubric), 
and provide feedback regarding the student's or intern's performance. A scoring 
rubric provides the rules to be followed when assessing the quality of a performance. 
Generally, the rubric is a rating scale or checklist of essential elements that must be 
included in the product. Point values are assigned according to the quality of each 
component. 

Popham (1999) indicated that three components must underlie authentic per- 
formance assessments: (1) Multiple evaluative criteria must be used; (2) each of the 
evaluative criteria must be clearly articulated and defined prior to judging the per- 
formance; and (3) human judgments are necessary to determine the acceptability of 
performance responses. It is this final component that critics of performance assess- 
ment take issue with. The acknowledged weakness of performance assessment is the 
difficulty of establishing the reliability and validity of scores — which are critical re- 
gardless of the type of assessment undertaken. Because performance assessment is 
time consuming, it may be possible to complete only one or several problems (i.e., 
authentic science problems to be solved, perhaps even a single "experiment") over 
the course of a two-hour examination, whereas a student may be able to complete 
more than 100 multiple-choice problems during the same period. An important sta- 
tistical concept within test development is that, all else being equal, the more items 
a test possesses, the more reliable the scores on that test (Anastasi & Urbina, 1997). 
Because human judgment (i.e., subjectivity) is required in performance assessment, 
interscorer reliability becomes an important issue. In nearly all circumstances, the 
multiple-choice test will be more reliable than the performance test, and test scores 
can be no more valid (useful) than they are reliable (consistent). Thus there is a trade- 
off in using paper- and-pencil and performance tests. Paper-and-pencil tests may be 
more efficient and psychometrically superior (i.e., have a higher reliability of scores), 
but performance assessments may get closer to the real-life circumstances for which 
a student is being prepared. These dilemmas are explored in detail in the chapter on 
high-stakes testing, which is available on the companion website for this text. 



36 Chapter 1 



Practically speaking, as the owner of a car with a blown engine, who would 
you rather have working on your car: the mechanic who got more multiple-choice 
questions right or the one who rebuilt the engine in the quicker, more proficient 
manner? Perhaps a bit closer to home, who might a client prefer as a professional 
counselor: the one who received the higher score on the National Counselor 
Examination (NCE) or the one who performed better on the videotapes? If you 
said, "The one who did better (or well) on both," you can count yourself among 
a growing segment of professionals who see the benefits of both approaches. 
Breadth and depth are both critical elements of comprehensive assessment. 



Portfolio Assessment 



Portfolio assessment is a specific, and currently popular, type of performance assess- 
ment espoused by proponents of the philosophy that instruction and assessment are 
one and the same. A portfolio is a systematic and well-organized collection of work 
produced by an individual with the purpose of demonstrating that individual's skills 
and achievements. Portfolios have been used in the professions of art, architecture, 
modeling, journalism, and photography for years. In these professions, the individual 
selects exemplary works that demonstrate competence, style, talent, and versatility. In 
many counseling programs, counselors-in-training are required to develop a portfolio 
of exemplary works (e.g., counseling tapes or analyses, course papers or projects, 
events or lessons implemented, ancillary products developed). Portfolios are a wonder- 
ful way for interns to demonstrate for program faculty members the depth of their 
learning and understanding, and for potential employers the likely quality one could 
expect of the applicant if hired as an employee. However, portfolio assessment presents 
examiners with a couple of challenging problems: How does one go about evaluating 
the quality of a portfolio? Will the assessment lead to reliable and valid results? 

By now, this problem should sound familiar, and the reader should have some 
ideas as to how to solve the dilemma. Because portfolio assessment is a type of per- 
formance assessment, rubrics and other issues discussed in the performance assess- 
ment section also apply here. What is critical is that evaluators of portfolios acknowl- 
edge that the assessment system devised must conform to the highest level of 
technical adequacy possible. If it does not, students and evaluators will waste much 
time and effort on an assessment process that is difficult (perhaps impossible) to eval- 
uate. Such an assessment system could be perceived as burdensome, worthless, un- 
fair, and even biased. 

It is widely agreed that assessment of portfolios should involve both a self-assess- 
ment and an external assessment (Farr & Tone, 1994; Popham, 1999). In a self-as- 
sessment, the student provides evaluative commentary of the included works and how 
each meets certain requisite standards or demonstrates required mastery. The encour- 
agement of self-evaluation is an important developmental skill in its own right and 
is a strength of the portfolio process. External assessment involves the process of ob- 
taining judgments from professionals not related to the situation in which the works 
were created, but in a good position to evaluate those works. For example, in the ex- 
ample of a counselor-in-trainings portfolio, it is likely that program faculty would 
be somewhat biased in their evaluation of student works. Indeed, studies have shown 



Basic Assessment Concepts 37 



Table 1.2 Advantages and disadvantages of portfolio 
(and performance) assessment 



Advantages 

1. Focuses on "doing." 

2. Allows for demonstration of examinee strengths, flexibility, and adaptability. 

3. Highlights improvements rather than comparisons. 

4. Focuses on processes and products. 

5. Provides self- assessment and analysis. 

6. Assesses depth of understanding and application of instruction. 

7. Integrates knowledge, skills, and abilities. 

8. Allows diagnosis of strengths and weaknesses. 

9. Provides concrete examples of application of skills. 
10. Facilitates performance-based instruction. 

Disadvantages 

1 . Evaluation process is time-intensive for students and evaluators. 

2. Useful and accurate rubrics are difficult to create. 

3. Interscorer reliability is low. 

4. Judges require a lot of training. 

5. Stakeholders often have difficulty understanding the results. 

6. Performance tasks must be well crafted and meaningful. 

7. Performance on one task is often unrelated to performance on other tasks. 

8. Students are frequently unsure which products to include and why. 

9. Performance tasks are difficult and frustrating for low-ability students. 

10. Some cultural or socioeconomic groups may underperform on certain types of performance 
tasks (i.e., bias). 



that teachers tend to be biased toward their own students' work (Popham, 1999). 
Thus it would be best to solicit volunteers from the professional community unre- 
lated to the program or students. 

Rubrics established for portfolio assessment must be specifically written and dis- 
tributed to students well in advance so they can prepare showcase or best-work port- 
folios that will address the portfolio standards. Alternatively, students can be encour- 
aged to develop portfolios that demonstrate growth and learning over time. 
Unfortunately, compared to most other types of assessments, portfolio assessment 
tends to be time consuming, expensive, and lacking in technical rigor (i.e., reliabil- 
ity and validity). All in all, the portfolio assessment process presents numerous dif- 
ficult challenges, and "to date, the results of efforts to employ portfolios for account- 
ability purposes have not been encouraging" (Popham, 1999). Table 1.2 presents a 
number of advantages and disadvantages of portfolio assessment. 



Think About It 1 .2 Imagine that you are preparing for an employment 
interview. What kinds of "products" from your courses and clinical experi- 
ences would you include in your portfolio to demonstrate your effectiveness 
as a professional counselor? 



38 Chapter 1 



Environmental Assessment 



Environmental assessment moves the focus of assessment and evaluation from the 
individual to the environment in which the individual functions. In workplaces, re- 
lationships, and other social situations, clients often complain that they "don't fit in." 
Normally, the focus of counseling is on how clients can change to better adapt to 
their environment and circumstances. But what if the environment could be altered, 
or changed altogether? For example, clients with an alcohol dependency may bene- 
fit from an analysis of the "who," "where," and "what" related to their social activi- 
ties. Such an analysis may point to factors that are actually barriers to recovery and 
abstinence. In schools, the contingencies in a classroom may become the primary 
focus of a behavioral observation and assessment to determine what environmental 
factors may account for the difficulties a student may be encountering. In families, 
professional counselors are keenly interested in the family environment so that sys- 
temic changes can be made to get a family moving in a more positive direction. 

As a more specific example from the career realm, employees sometimes com- 
plain about workplace conditions, stress, and burnout but continue to work in such 
environments for years and years. Some researchers (e.g., Holland, Gottfredson) are 
addressing this problem by designing measures that assess the environmental context 
and the individual, using Holland's model, featured in the Self-Directed Search (SDS) 
(Holland, Fritzsche & Powell, 1994), to determine whether the client's interests and 
competencies actually match the demands of the work environment. In a simplistic 
extension, individuals who need to be physically active but who have a job that re- 
quires a lot of desk work may experience a "disconnect" and unhappiness. Likewise, a 
"people person" may be unhappy slogging away in a cubicle all day, or at least trying 
to survive until lunch or quitting time. In both of these circumstances, altering or 
changing the environment or work tasks within an environment may be the solution 
to client concerns. In some form or another, environmental assessment has been 
around for decades, but it appears currently to be experiencing a resurgence. 



Computer-Managed, Assisted, and Adapted Assessment 



Computers are in the process of revolutionizing psychological and educational as- 
sessment. As far back as the 1930s, technological innovations have helped make the 
process of assessment more efficient and accurate, usually by speeding up scoring 
procedures for large-scale test administrations. With the widespread availability of 
personal computers since the 1970s, test publishers have actively pursued the pro- 
duction of computer software that allows a clinician's computer to administer, score, 
and even interpret a client's protocol in the comfort and convenience of the clini- 
cian's own office. With easy access to the Internet through home and office com- 
puters and public venues such as libraries, the possibilities for computer-assisted 
assessment have become nearly unlimited. Of course, these wonderful access oppor- 
tunities have arrived with a plethora of ethical and legal dilemmas. 

Computer-managed assessment, also known as computer-assisted assessment, in- 
volves the harnessing of computers to administer, score, and interpret tests. Some 



Basic Assessment Concepts 39 

software packages or Internet sites allow an integration of all three of these functions, 
while others may allow only one or two. Integrated functions are becoming more 
and more the standard. Today, computers can even store and accumulate test results 
for a single client or an entire school system in order to manage and compile sum- 
mary reports. The implications of such computer-managed systems are phenomenal 
because such databases can facilitate everything from individual treatment plans to 
outcome assessment of an agency's clientele or the evaluation of an entire school sys- 
tem's curriculum. 

Many historically paper-and-pencil tests are now available in online or per- 
sonal computer versions. Using individualized computer-assisted assessment, the 
student or client generally completes the assessment at a personal computer on 
which specialized software has been installed, or on a computer linked to an 
Internet website offering the service. Responses are easily made by clicking the 
mouse on an appropriate answer space, using a touch screen, or typing a response. 
Frequently, tests can be automatically scored and an interpretive report printed out 
within seconds after completing the test. The comprehensiveness and quality of 
these reports vary substantially. For example, the computerized packages offered 
by Pearson to administer, score, and interpret the MCMI-III (Millon, Davis, & 
Millon, 1997) and the MMPI-2 (Butcher et al., 1992) include comprehensive, de- 
tailed narratives of likely examinee characteristics and behaviors as well as diagnos- 
tic and treatment implications — for about $20 a client. On the other hand, the 
WISC-IV and WJ-III ACH computer scoring programs, which come with the stan- 
dard test kit package, provide only basic scoring and storage functions. When con- 
sidering the purchase of assessment software or other scoring services, it is a good 
idea to ask the publisher for samples of reports to determine whether they will meet 
one's professional needs — at a reasonable price. 

There are several advantages to computer- assisted assessment. Depending on the 
program, cost savings can be substantial, particularly given the speed of scoring and 
interpretive reports. Some tests and inventories may require hours to comprehen- 
sively score and interpret, whereas the computer program for the same test or inven- 
tory may require only seconds. Clients and students also have much greater control 
over the rate of response and interaction; thus individuals who desire a quicker or 
slower pace are accommodated by the computer. In addition, clients with special 
needs can sometimes be better accommodated by computer administration. Clients 
with visual handicaps or reading problems and who need to have items read orally 
may find auditory computer administration more user-friendly. Clients with writ- 
ing disabilities may find the mouse or keyboard easier to manage than a pen or pen- 
cil. Students with visual processing disorders may find the auditory instruction ca- 
pabilities and larger graphical displays of computers easier to adjust to, as opposed 
to a bubble form that may look like a jumbled mess. Clients with attentional prob- 
lems may find the computer administration more engaging than a response booklet. 
The possibilities for accommodating clients with disabilities are substantial. 

Computer-adapted assessment involves an interactive process between the ex- 
aminee and the computerized assessment device. Computer-adapted assessment 
usually entails varying administration formats depending on the responses of the 



40 Chapter 1 



examinee to previous questions. For example, when taking the computer-adapted 
version of the Graduate Record Exam (GRE) administered by the Educational 
Testing Service (ETS), two examinees may be administered very different item sets 
depending on their abilities. On the paper-and-pencil administration of the GRE, 
all examinees respond to (virtually) the same set of questions, whether the student 
is of high or low ability. This leads to high-ability students answering questions 
that are mostly far too easy and lower-ability students answering questions that are 
mostly far too difficult. Computer-adapted testing solves this dilemma by estab- 
lishing a bank of items for which the item difficulties and other technical item 
characteristics are already known. An examinee with strong ability will be admin- 
istered an item of moderate difficulty, respond to it correctly, and receive an even 
harder question. The computer automatically scores the item, tracks performance, 
and is programmed to administer subsequent items until a very good estimate of 
the student's performance is obtained. Generally, students who respond correctly 
to an item continue to receive more and more difficult items until a plateau in per- 
formance occurs. At this point, the administration stops, and a final score is deter- 
mined. Note that a higher-performing student never receives the easier items in 
the item bank, but continues to be administered items of ability-appropriate dif- 
ficulty. For a student with lower ability, an incorrect response to the first moder- 
ately difficult item will be followed by an easier item. If this second item is missed, 
an easier item follows; if the response to this item is correct, a more difficult item 
follows. The process continues again until a plateau in performance is reached. 
Note that the lower-performing student is never administered the more difficult 
items that a higher-performing student receives, but continues to receive the more 
appropriate, less difficult items. 

Many aptitude and achievement tests now offer a computer-adapted adminis- 
tration format. Generally, examinees complete computer- adapted assessments in less 
time than paper-and-pencil administrations, and the results are available instanta- 
neously, rather than in the typical weeks-to-months wait time for mail-in scoring 
services. It is likely that computer-adaptive testing will eventually be used in other 
areas of assessment, particularly clinical, personality, and career assessment. For ex- 
ample, self-report during a computerized structured clinical interview protocol could 
allow clients to respond negatively to essential features of a major diagnostic cate- 
gory (e.g., "depressed mood or loss of interest or pleasure in normal activities") and 
subsequently skip all associated structured interview questions related to a disorder 
that is not applicable. The elimination of inappropriate items could yield a large time 
savings. 

The advantages of using computers for assessment are many. Computers are not 
prone to bias; they do not discriminate on the basis of sex, race, ethnicity, sexual ori- 
entation, and so forth, as some clinicians may. It is also far easier to revise adminis- 
tration, scoring, and interpretive procedures when an examination is online, because 
the changes are instantaneous. These features offer a real advantage over paper-and- 
pencil administration, in which some professionals may continue to use older ver- 
sions of a revised test simply because they have a stockpile of the older protocols they 
wish to use up, for economy's sake, before ordering newer materials. In addition, 



Basic Assessment Concepts 41 

there is some evidence that clients may self-disclose sensitive information more hon- 
estly during computer administration, because of greater perceived anonymity, than 
during the face-to-face disclosures that occur during a typical interview (Davis, 
1999; Joinson & Buchanan, 2001). Perhaps the most overlooked advantage of on- 
line assessment is the potential for access to quality services by professional coun- 
selors, clients, and students who are geographically isolated or in some other way un- 
able to participate in more mainstream mental health services. 

Important disadvantages of using computers for assessments relate to observa- 
tion and comfort issues. Generally, when computerized assessments are used, the 
professional counselor is occupied elsewhere and not focused on observing the client 
or student engaged in the assessment process. Much helpful information can be lost 
when a client's assessment-related behaviors go unnoticed. To compound this issue, 
computer-generated interpretive reports are often accepted at face value by clinicians 
and imported wholeheartedly into reports and summaries. As is mentioned in the 
ethics discussion in Chapter 2, computerized interpretive reports are considered pro- 
fessional-to-professional consultations, and the burden of what to report and what 
not to report lies with the professional counselor charged with the care of the client. 
Computer-generated reports are meant to supplement a clinician's interpretation, 
not supplant it. Sampson, Purgar, and Shy (2003, p. 27) suggested that professional 
counselors should have, at a minimum, the following competencies to use computer- 
based test interpretation (CBTI) information effectively: 

1 . An understanding of the construct or behavior 

2. An understanding of the test, including the theoretical basis (if any), item selec- 
tion and scale construction, standardization, reliability, validity, and utility 

3. An understanding of the test interpretation, including scale interpretations and 
recommended interventions based on scale scores 

4. An understanding of the CBTI, including the equivalence of test forms (if inter- 
pretations from an original form are used) and the evidence of CBTI validity 

5. Initial supervised experience in using the test and CBTI (with supervision pro- 
vided by an appropriately qualified practitioner) 

While computers are becoming more commonly used, tremendous diversity in 
use currently exists. People vary in their experience with and attitudes toward com- 
puters. While most people appear favorably disposed toward computers, group and 
individual differences have been noted. For example, Barak (2003) observed that un- 
easiness with technology led to lower performance tendencies in women in online as- 
sessments. While this empirical result has not been consistently verified, professional 
counselors are well advised to ensure computerized assessment technologies do not 
hold some groups at performance disadvantages. 

Of course, the proliferation of computer-based assessment services is not with- 
out a cost. Frequently, online tests are not developed with the same attention to 
technical rigor as the print versions of standardized tests, and information on the 
reliability and validity of online scores is sometimes impossible to obtain. In addi- 
tion, expert verification of rigor is more challenging because the testing experts 
may need to be familiar with sophisticated computer programming language in 



42 Chapter 1 



order to evaluate the interpretive procedures programmed into the software. The 
security and confidentiality of online assessments continue to be of major concern 
in the industry, although new encoding and encryption software shows promise in 
resolving these issues. With paper-and-pencil tests, the responsibility for the secu- 
rity of the tests and test results falls squarely on the professional counselor, who 
can frequently secure the information under lock and key. The issue of security 
and confidentiality becomes more complex when personal computers and Internet 
providers are involved, and professional counselors must take great care to ensure 
the security of the tests and the integrity of the assessment process. Finally, the 
Standards for Educational and Psychological Tests and Manuals (AERA/APA/NCME, 
1999) specify that examinees offered a choice between computerized and paper- 
and-pencil tests should be educated about the features, characteristics, and pros 
and cons of each type of administration format. 



Think About It 1.3 What are the ways you anticipate using computers 
in your practice as a professional counselor? Think about and seek out the 
type of training you will need. 



SUMMARY/CONCLUSION 



This chapter has addressed purposes, standards, and terminology related to profes- 
sional use of assessment. Assessment has four purposes: screening, diagnosis, treat- 
ment planning and goal identification, and progress evaluation. Each contributes 
substantially to the overall counseling process. In addition, the counseling field has 
a number of sources intended to guide assessment education and practice. The 
Council for the Accreditation of Counseling and Related Educational Programs 
(CACREP) has established curricular requirements, and a number of professional 
organizations have developed standards to aid counselors in understanding good as- 
sessment practices. Finally, professional counselors need to be familiar with the terms 
and phrases essential to the field in order to communicate effectively with other pro- 
fessionals, to advocate for clients and students, and to make decisions in their best 



interests. 



KEY TERMS 



affective assessment 

assessment 

basal series 

behavior 

behavioral observations 

ceiling series 

cognitive ability test 



computer-adapted assessment 

computer-managed assessment 

criterion-referenced test 

diagnosis 

environmental assessment 

formative evaluation 

group test 



Basic Assessment Concepts 43 



individual test 

maximum performance measurement 

nonstandardized test 

nonverbal test 

norm-referenced test 

objective 

objective test 

performance assessment 

portfolio assessment 

power test 

projective technique 

psychological test 

reliability 



sampling 

screening 

speeded test 

standardization 

standardized test 

starting point 

subjective 

subjective test 

summative evaluation 

test 

typical performance measurement 

validity 

verbal test 




CHAPTER 



2 



Foundations of Assessment: 
Historical, Legal, Ethical, 
and Diversity Perspectives 

by Bradley T. Erford, Cheryl Moore-Thomas, and Lynn Linde 



This chapter highlights the historical, legal, ethical, and diversity issues impor- 
tant to a professional counselor's understanding of assessment. From ancient 
times through modern day, assessment has been important to humankind's 
self-understanding, and a tool both for fairness and oppression, however intended. 
While many historical events were important to the evolution of assessment in gen- 
eral, this chapter explores events relevant to assessment in the more specialized areas 
of intelligence, achievement, career, and clinical and personality. Professional coun- 
selors are engaged in a variety of ways in ensuring that clients and students receive 
appropriate assessment in these areas. Therefore, a review of legal, ethical, and pro- 
fessional standards regarding assessment, diversity factors affecting assessment, and 
test bias is also provided. The chapter concludes with a discussion of strategies, coun- 
seling interventions, and recommendations to ensure fair testing. 



THE HISTORY OF ASSESSMENT 



Throughout recorded history, people have attempted to measure and assess human 
characteristics and traits. What follows is a brief exploration of these attempts over 
more than the past three millennia, segmented into three historical periods: ancient 
times, measurement in the laboratory, and modern clinical applications. A summary 
timeline of historic events in the field of assessment is included in Table 2.1. 

45 



46 Chapter 2 

Table 2.1 Assessment timeline 



500 BCE Greeks may have used assessments for educational purposes. 

220 BCE Chinese set up civil service exams to select mandarins. 

AD 1219 English university administers first oral examination. 

ca. 1510 Fiteherbert proposes first measure of mental ability (identification of one's age, counting 20 pence). 

1 540 Jesuit universities administer first written exams. 

1575 Spanish physician Huarte defines intelligence in Examen de Ingenius (independent judgment, meek 

compliance when learning). 
1 599 Jesuits agree to rules for administering written exams. 

1636 Oxford University requires oral exams for degree candidates. 

1692 German philosopher Thomasius advocates for obtaining knowledge of the mind through objective, 

quantitative methods. 
1799 In working with the "Wild Boy of Aveyron," Itard differentiates between normal and abnormal cognitive 

abilities. 
1803 Oxford University introduces written exams. 

1809 Gross develops theory of observational error. 

1834 Weber, pioneer in the study of individual differences, studies awareness thresholds. 

1835 Quetelet develops and studies normal probability curves. 

1837 Seguin develops the Seguin Form Board Test And opens school for mentally retarded children. 

1838 Esquirol advocates differences between mental retardation and mental illness, proposes that mental 
retardation has several levels of severity. 

1869 Galton, founder of individual psychology, authors Hereditary Genius, sparking study of individual differences 

and cognitive heritability. 
1879 Wundt establishes world's first psychological laboratory at the University of Leipzig in Germany. 

1888 J. M. Cattell establishes assessment laboratory at the University of Pennsylvania, stimulating the study of 

mental measurements. 
1 890 Cattell coins the term mental test. 

1897 Ebbinghaus develops and experiments with tests of sentence completion, short-term memory, and 

arithmetic. 

1904 Spearman espouses two-dimensional theory of intelligence {g = general factor, s = specific factors). 
Pearson develops theory of correlation. 

ca. 1905 E. L. Thorndike writes about test development principles and laws of learning and develops tests of 

handwriting, spelling, arithmetic, and language. He later introduces one of first textbooks on the use of 

measurement in education. 

First standardized group tests of achievement published. 

Jung's Word Association Test published. 

1905 Binet and Simon introduce first "intelligence test," to screen French public schoolchildren for mental 
retardation. 

1909 Goddard translates Binet-Simon Scale into English. 

1912 Stern introduces term mental quotient. 

1916 Terman publishes the Stanford Revision and Extension of the Binet-Simon Intelligence Scale. 

1917 Yerkes and colleagues from the APA publish the Army Alpha and Army Beta tests, designed for the 
intellectual assessment and screening of U.S. military recruits. 

1918 Otis publishes the Absolute Point Scale, a group intelligence test. 

1919 Monroe and Buckingham publish the Illinois Examination, a group achievement test. 
Woodworth Personal Data Sheet published. 

1921 Rorschach publishes his inkblot technique. 

1923 Kelly, Ruch, and Terman publish the Stanford Achievement Test. 

Kohs Block Design Test measures nonverbal reasoning. 



Foundations of Assessment 47 

Table 2.1 continued 

1924 Porteus publishes the Porteus Maze Test. 

Seashore Measures of Musical Talents published. 
Spearman publishes Factors in Intelligence. 

1 926 Goodenough publishes the Draw-a-Man Test. 

1927 Spearman publishes The Abilities of Man: Their Nature and Measurement. 

1928 Arthur publishes the Point Scale of Performance Tests. 

1931 Stutsman publishes the Merrill-Palmer Scale of Mental Tests. 

1933 Thurstone advocates that human abilities be approached using multiple-factor analysis. 

Tiegs and Clark publish the Progressive Achievement Tests, later called the California Achievement Test. 

Johnson develops a test scoring machine. 

1935 Murray and Morgan develop the Thematic Apperception Test. 

1 936 Piaget publishes Origins of Intelligence. 

Lindquist publishes the Iowa Every-Pupil Tests of Basic Skills, later renamed the Iowa Tests of Basic Skills. 
Doll publishes the Vineland Social Maturity Scale. 

1937 Terman and Merrill revise their earlier work (Terman, 1916) as the Stanford-Binet Intelligence Scale. 

1938 Buros publishes first volume of the Mental Measurements Yearbook. 
Bender publishes the Bender Visual-Motor Gestalt Test. 

Gesell publishes the Gesell Maturity Scale. 

1939 Wechsler introduces the Wechsler-Bellevue Intelligence Scale. 
Original Kuder Preference Scale Record published. 

1940 Hathaway and McKinley publish the Minnesota Multiphasic Personality Inventory (MMPI). 
Psyche Cattell publishes the Cattell Infant Intelligence Scale. 

1949 Wechsler publishes the Wechsler Intelligence Scale for Children (WISC). 

Graduate Record Exam (GRE) published. 

1955 Wechsler revises the Wechsler-Bellevue Intelligence Scale as the Wechsler Adult Intelligence Scale ( WAIS). 

1956 Bloom publishes Taxonomy of Educational Objectives. 
Kuder Occupational Interest Survey published. 

1957 Osgood designs the semantic differential scaling technique. 

1959 Guilford proposes the structure of intellect model in his The Nature of Human Intelligence. 
Dunns publish the Peabody Picture Vocabulary Test. 

National Defense Education Act provides funding for career assessment screening and high school counselor 
positions. 

1 960 Stanford-Binet Intelligence Scale revised. 

1961 Kirk and McCarthy publish the Illinois Test of Psycholinguistic Ability. 
1963 R. B. Cattell introduces theory of crystallized and fluid intelligence. 

1965 Strong Vocational Interest Blank published. 

1966 AEPvA, APA, and NCME publish the Standards for Educational and Psychological Testing. 

1967 Wechsler publishes the Wechsler Preschool and Primary Scale of Intelligence (WPPSI). 
1969 Bayley publishes the Bayley Scales of Infant Development. 

National Assessment of Educational Progress program implemented. 

Jensen publishes controversial How Much Can We Boost IQ and Scholastic Achievement? 

1972 Form L-M (3rd ed.) of Stanford-Binet Intelligence Scale released. 
McCarthy publishes McCarthy Scales of Children's Abilities. 

1973 Marino publishes Sociometric Techniques. 

1 974 Wechsler Intelligence Scale for Children — Revised ( WISC-R) published. 
Congress passes the Family Educational Rights and Privacy Act (FERPA). 

1975 Congress passes Public Law 94-142, the Education for All Handicapped Children Act. 
Kuder's General Interest Survey, Form E published. 

continued 



48 Chapter 2 
Table 2.1 continued 



1977 System of Multicultural Pluralistic Assessment (SOMPA) published. 

1979 Federal judge Roberr P. Peckham rules in Larry P. v. Wilson Riles that intelligence tests are culturally biased 

when used to determine African American children's eligibility for mental retardation services. 

1979 Leiter International Performance Scale, a language-free test of nonverbal ability, published. 

1980 In Parents in Action on Special Education v. Joseph P. Harmon, Illinois judge Grady concludes that intelligence 
tests do not discriminate against African American children due to cultural or racial bias. 

New York state legislators pass Truth in Testing Act. 
1 980s Volumes 1—7 of Test Critiques published. 

High-speed computers begin to be used in large-scale testing programs. 
Computer-adaptive and computer-assisted testing developed. 

1981 Wechsler publishes the Wechsler Adult Intelligence Scale — Revised ( WAIS-R) . 

1983 Kaufman publishes the Kaufinan Assessment Battery for Children (K-ABC). 

1984 U.S. Employment Service publishes the General Aptitude Test Battery. 

1985 Sparrow, Balla, and Cicchetti revise the Vineland Adaptive Behavior Scales, originally published by Doll 
(1936). 

AERA, APA, and NCME revise the Standards for Educational and Psychological Testing. 

1986 Stanford-Binet Intelligence Scale — Fourth Edition (SBIS-4) published, as revised byThorndike, Hagen, and 
Sattler. 

1989 Minnesota Multiphasic Personality Inventory — Second Edition (MMPI-2) published. 

Wechsler Preschool and Primary Scales of Intelligence revised. 
1990s Authentic (performance) assessment and high-stakes testing rise to prominence. 

Volumes 11-13 of Mental Measurements Yearbook published. 

Volumes 8-10 of Test Critiques published. 

1 99 1 Wechsler Intelligence Scale for Children — Third Edition ( WISC-IID published. 
Kuder s Occupational Interest Survey, Form DD published. 

1992 Wechsler Individual Achievement Test (W I AT) published. 

1 997 Wechsler Adult Intelligence Scale — Third Edition (WAIS-IIJ) published. 

1999 AERA, APA, and NCME publish Standards for Educational and Psychological Testing — Third Edition. 
Volume 5 of Tests in Print published. 

2000 Nader and Nairn publish The Reign ofETS. 

2001 Mental Measurements Yearbook becomes available through an electronic retrieval system. 

2002 Educational Testing Service revises its Scholastic Assessment Test (SAT). 

Wechsler Preschool and Prima ry Scales of Intelligence — Third Editio n ( WPPSI- III) p u b I i s hed . 

2003 Wechsler Intelligence Scale for Children — Fourth Edition ( WISC-IV) published. 
Stanford-Binet Intelligence Scale — Fifth Edition (SB-5) published. 



Ancient Times 



Assessment has been used and documented in many civilizations throughout history. 
As far back as 220 BCE, and continuing for more than 2,000 years, the Chinese had 
an elaborate civil service examination system to select mandarins for public service 
(Dubois, 1966, 1970). Every third year, candidates would gather to undergo tests of 
skill in areas such as horsemanship, archery, and music. Essay tests were administered 
to assess a candidate's writing skills. 

Knowledge was assessed in such areas as military competence, civil law, geogra- 
phy, and public and social ceremonies and rites. The Chinese strove to develop a fair 



Foundations of Assessment 49 

and objective system by eliminating systematic bias when observed. For example, 
they used multiple judges to rare performance, rather than a single judge, and even 
had scribes copy written work in a standard handwriting format to focus judges on 
the ideas and content of a composition rather than on the differences in penmanship 
between candidates (Thorndike, 1997). Even in the early years, they went to grear 
lengths to prevent cheating by isolating candidates during written and performance 
exams (Bowman, 1989). Many of these practices endure today. Such an elaborate 
system was deemed necessary in order to select the best candidates on merit, not pa- 
tronage — and the failure rate often exceeded 90%. These grueling exams went on 
for 72 uninterrupted hours. 

It is frequently hypothesized that the ancient Greeks, perhaps around 500 BCE, 
used testing in the educational processes of that day. Indeed, both Socrates and Plato 
are believed to have emphasized that efficient career choices should rely heavily on a 
student's demonstrated abilities and aptitudes. Unfortunately, much of the histori- 
cal record for the next 2000 years was lost. In 1 540, the Jesuits, a holy order of the 
Roman Catholic Church dedicated to education and scholarly pursuits, became 
early leaders in the establishment of assessment procedures at the university level by 
administering the first written examinations. As one can imagine, this was a some- 
what controversial endeavor, followed by much debate over bias and fairness. Nearly 
60 years later, the Jesuits issued agreed-upon rules for administration of written 
exams. This innovation was cautiously followed and implemented by other univer- 
sities over the next several centuries. 



Measurement in the Laboratory 



A second "movement" in the history of assessment involved the use of testing in the 
emerging field of experimental psychology. This field sought to harness the emerg- 
ing use of the scientific method to explore the psychological world of human beings. 
Prior to the use of the scientific method, mathematical models, such as those devel- 
oped by Herbart, Weber, and Fechner were used to describe the effects of such con- 
cepts as stimulus intensity and psychological thresholds. 

Charles Darwin is often credited with spurring the experimental interest in in- 
dividual differences through publication of his book On the Origin of Species by 
Means of Natural Selection in 1859 (Cohen & Swerdlik, 1999). Darwin proposed 
that individual differences in adaptation and characteristics accounted for the sur- 
vival of entire species and individuals within species. His theory of evolution was 
controversial and thought provoking. It was especially inspiring for Darwin's half- 
cousin, the English biologist Sir Francis Galton, who made tremendously influenrial 
contributions to the early attempts at measurement of individual differences and 
cognitive heritability (Forest, 1974). 

Galton developed numerous techniques and instruments for measuring individ- 
ual physical and psychological characteristics, and his methods inspired the precur- 
sors to modern-day rating scales and surveys. Overall, he inspired a whole generation 
of laboratory researchers to determine individuals' "deviation from average" (Galton, 
1869, p. 11) and to classify individuals "according to their natural gifts" (p. 1) 



50 Chapter 2 



through his studies of heritability on sweet peas. Galton's goal was to study human 
heredity by measuring the characteristics of related and unrelated individuals and 
showing that some characteristics made individuals more "fit for survival" than oth- 
ers. He was one of the first scholars to propose that intelligence could be measured 
through assessing sensory capabilities, for intelligence stems from information, and 
all information must pass through the senses. Thus the more acute and attuned one's 
senses, the greater the likelihood of information being passed through the senses and 
influencing intellectual judgments. In 1884 he opened an exhibit at the Inter- 
national Health Exhibition, which was later reestablished at University College, 
London, as the Anthropomorphic Laboratory. Here Galton measured human char- 
acteristics and abilities such as height, weight, arm span, muscular strength, reaction 
time, discrimination of color, and visual acuity. These initial attempts at measure- 
ment, while considered to be invalid measures of intelligence by today's standards, 
nonetheless created widespread excitement in the burgeoning field of psychological 
measurement. Galton also proposed the statistical concept of correlation, although 
it was the mathematician Karl Pearson — Galton's student, close friend, and biogra- 
pher — who later provided the statistical formula for linear correlation (i.e., the 
Pearson product-moment correlation coefficient) that has endured to present day. 

In 1879, Wilhem Wundt opened the world's first experimental psychology lab- 
oratory, at the University of Leipzig in Germany. He is widely regarded as the 
founder of the science of psychology (Hearst, 1979), and many of the early experi- 
mental psychologists, including Louis Leon Thurstone and James McKeen Cattell, 
studied at his lab. The hallmark of this era was the drive to rigorously control exper- 
imental conditions in order to standardize observations and collection of data. 

Cattell, a U.S. psychologist, was inspired by Galton's writings to conduct his 
doctoral dissertation on individual differences in reaction time, a study that contin- 
ued the momentum toward measurement of human characteristics. In 1890, Cattell 
was the first to use the term mental test to describe his efforts to measure intelligence. 
Kraepelin (1895) and his student Oehrn (1889), developed more sophisticated men- 
tal ability tests, including arithmetic, memory, and perceptual tasks. In addition, 
Ebbinghaus (1897) developed sentence completion, arithmetic, and short-term 
memory tasks. All of these early efforts to develop psychological tests continued the 
movement to the modern era of assessment. 



Modern Clinical Applications of Assessment: Decision Making 
and Determination of Individual Differences 



In any field of study it is important for the stage to be set with precursor events until 
a critical mass of knowledge has developed; historical events or social needs arise; and 
motivated, creative thinkers move the emerging field forward. In the field of assess- 
ment, many pioneers took the developing field in numerous directions quite quickly, 
leading to an explosion of assessment applications during the 20th century. These 
applications were primarily directed at identifying differences between and among 
individuals so that identification and diagnostic practices, as well as intrapersonal 
strengths and weaknesses, could be translated into remedial and treatment strategies. 



Foundations of Assessment 5 1 

At the core, these efforts were directed at helping clinicians and educators make bet- 
ter, more accurate decisions about human beings than could be made through other, 
less standardized methods of the day. Most notably, the field moved in four primary 
directions: intellectual assessment, achievement assessment, vocational and career as- 
sessment, and clinical and personality assessment. 

Intellectual Assessment 

Many individuals have contributed to the rise of testing with educational and clini- 
cal applications. In many ways, the work of Galton, Cattell, and Kraepelin laid the 
foundation for the proliferation of these tests during the 20th century. Early at- 
tempts at measuring intelligence stemmed from the need to develop procedures to 
identify students with mental and emotional deficiencies for remedial education. In 
the earliest recorded attempt, Seguin (1866/1907) in 1837 developed the Seguin 
Form Board Test, which in some ways resembles modern efforts to assess mental 
deficiencies. 

In France, the minister of public instruction appointed physiologist and psy- 
chologist Alfred Binet to a commission tasked with determining efficient ways to 
identify children with mental retardation. Working with a French physician, 
Theodore Simon, Binet constructed the first practical intelligence test in 1905, the 
Binet-Simon Scale. This scale presented 30 brief tasks in approximate order of diffi- 
culty accompanied by relatively precise administration instructions. The original 
scale was administered under these standard conditions to a standardization sample 
of 50 children. With this comparison group, Binet could now determine any new 
child's score and evaluate or interpret it within some context. This revolutionary 
process, while crude by today's standards, allowed for a rudimentary decision-mak- 
ing process about a child's intellectual ability. In addition, Binet and Simon departed 
from the traditional focus on assessing sensory processes and focused item develop- 
ment more on reasoning and judgment. Unfortunately, the original scale derived no 
index or standardized score other than a raw score. Thus interpretations were limited 
primarily to descriptions of whether the child had basically normal intelligence or 
how far above or below normal the child's score appeared to fall. A further limitation 
of the original scale was the poor representativeness of the standardization sample to 
the overall population. 

These limitations were addressed in the 1908 revision of the Binet-Simon Scale, 
which nearly doubled the number of items on the original scale. The standardiza- 
tion sample included more than 200 children and was more representative of the 
population the test was meant to assess. In addition, Binet introduced the concept 
of mental age, an important innovation at the time, which allowed the evaluator to 
determine performance in terms other than the raw score. Each scale task or item 
was evaluated to determine the average chronological age at which a child mastered 
the task. This helped to specify normal or average performance for each item accord- 
ing to an age equivalency, which became the item's "mental age." Thus a normal 7- 
year-old child would achieve a mental age of approximately 7 years, while a bright 
seven-year-old might have a mental age of 9 or 10 years. Conversely, a 7-year-old 
child with mental retardation might have a mental age closer to 4 or 5 years. The 



52 Chapter 2 



child's mental age (MA) and chronological age (CA) could be used to calculate a 
ratio intelligence quotient (IQ) using the formula [MA ■*■ CA] x 100, a rudimentary 
form of the modern IQ score. 

The Binet-Simon Scale received a minor revision again in 1911, but by this time 
the interest in assessing intelligence had caught on in a number of countries, includ- 
ing the United States. Lewis M. Terman of Stanford University translated the Binet- 
Simon Scale into English, adapting, revising, and adding many items and instruc- 
tions in the process. In 1916, Terman released the Stanford Revision and Extension of 
the Binet-Simon Intelligence Scale, featuring a standardization sample of more than 
1,000 people. In 1937, this test became the Stanford-Binet Intelligence Scale (SBIS), 
revised in 1960, 1972, and 1986. The SBIS is now in its fifth edition (Roid, 2003). 

Terman's contribution was noteworthy in several ways, perhaps most impor- 
tantly because it made the widespread assessment of intelligence possible. This was 
timely because around the same time that Terman released the Stanford-Binet, World 
War I broke out and the military had a tremendous need to screen soldiers in order 
to assign them to appropriate duties in an efficient manner. The army contacted 
Robert Yerkes, then president of the American Psychological Association, to seek the 
association's help in developing large-scale assessment instruments for selection and 
classification. Instruments of that time period were nearly all individual assessments, 
which were generally time-intensive and cost-prohibitive, requiring highly skilled 
evaluators — not the kind of efficient tools needed to screen thousands of military re- 
cruits each month. In 1917, Yerkes (1921) led a committee of many of that era's 
greatest measurement experts to produce two group-administered tests of ability: the 
Army Alpha, which required reading ability and comprehension, and the Army Beta, 
a nonverbal test used to assess the abilities of illiterate or non-English-speaking 
adults. These tests used a multiple-choice format, a recent innovation popularized 
by Arthur S. Otis. Although the tests were not completed in time to be of help in 
screening World War I recruits, these early efforts at developing individual and 
group-administered tests of intellectual ability fueled widespread optimism about the 
role assessment could play in society, especially in institutions such as education and 
the military. 

Interestingly, the first tests of intelligence were produced with little thought 
given to theoretical underpinnings — that is, they were atheoretical. It was not until 
the late 1920s that discussions about the definition, makeup, and characteristics of 
intelligence were held by scholars. Spearman (1927) proposed that intelligence is dis- 
played in two dimensions: one that helps an individual solve general tasks (g), and 
another that helps individuals solve specific tasks (s). Spearman's concept (g), per- 
haps the most famous, and infamous, in the field of intelligence testing, spurred a 
great deal of empirical study and philosophical and political discussion. For example, 
in contrast to Spearman, Thurstone argued that intelligence was not explained by 
one general (unidimensional) factor called intelligence, but was actually composed 
of seven primary mental abilities. Much more discussion on the topic of intellectual 
theories and models is presented in Chapter 10. Suffice it to say here that the early 
efforts by Binet, Spearman, Thurstone, and many others led to an explosion in 
modern-day intelligence and aptitude testing. 



Foundations of Assessment 53 

To be sure, there have been several periods of criticism associated with testing in 
general, and intelligence testing in particular. The first came during the 1930s, the 
time of the Great Depression, and stemmed from unclear expectations over the roles 
tests could and should play in measuring human experiences and abilities. Many 
challenges related to how to measure human abilities were raised during this time. 
Fortunately, social sciences took on these challenges with gusto, developing new as- 
sessment methods, tests, and more powerful statistical techniques to aid in analyzing 
test items and results. 

Perhaps the most famous name in U.S. intelligence testing today is David 
Wechsler (Wechsler passed away in 1981). In 1939, Wechsler, at the time a clinical 
psychologist in New York City's Bellevue Hospital, published the Wechsler-Bellevue 
Intelligence Scale. This individually administered test of adult intelligence was de- 
signed to measure the "global capacity of the individual to act purposefully, to think 
rationally, and to deal effectively with his environment" (p. 3). In 1955, Wechsler 
revised the Wechsler-Bellevue and changed the name to the Wechsler Adult Intelligence 
Scale (WAIS). It was revised again in 1981 and 1997, and this most recent edition is 
known as the Wechsler Adult Intelligence Scale — Third Edition (WAIS-III) (Wechsler, 
1997). Wechsler's adult test offered several innovations or practical facets that be- 
came industry standards over the years. First, his test was actually a series of "sub- 
tests," each measuring a different facet of intelligence. Each facet contributed to the 
overall (full-scale) intelligence quotient. Also, Wechsler was one of the first to use a 
standard deviation IQ, rather than the ratio IQ popularized by the Stanford- Binet. 
Finally, Wechsler took a very pragmatic view of intelligence, rather than a theoreti- 
cal view. Basically, Wechsler chose what he believed to be the most efficient and use- 
ful measures of intelligence from previously developed measures and developed orig- 
inal items to create a particularly engaging and user-friendly format. Sources for his 
subtests included the Army Alpha (Information, Comprehension, and Picture 
Arrangement) and Army Beta (Coding); the 1916 Stanford-Binet (Vocabulary, 
Similarities, Comprehension, Digit Span, and Arithmetic); the Healy Picture 
Completion Tests (Picture Completion); and the Kohs Block Design Test. Importantly, 
Wechsler combined scores from each subtest to arrive at an estimate of general men- 
tal ability (g), not numerous primary mental abilities or specific facets of intelligence 
that others (e.g., Louis Leon Thurstone, Robert Sternberg, Howard Gardner) have 
described. The subtest format was simply a method for measuring general intelli- 
gence through multiple measures. 

The success of the adult Wechsler scale led Wechsler to develop a version for use 
with school-aged children from 6 to 16 years. In 1949, Wechsler published the 
Wechsler Intelligence Scale for Children (WISQ. The WISC was revised in 1974, 1991, 
and 2003. It is currently known as the Wechsler Intelligence Scale for Children — 
Fourth Edition ( WISC-IV) (Wechsler, 200 1 a) and follows a subtest format similar to 
that of the adult version. It is the most commonly used individually administered 
intelligence test in the world. In order to address the recent increased need for 
assessing intelligence in the preschool population, Wechsler (1967) published the 
Wechsler Preschool and Primary Scale of Intelligence (WPPSI), again following a sub- 
test format similar to that of the child and adult Wechsler versions. The WPPSI was 



54 Chapter 2 



revised in 1989 and again in 2002 and is currendy known as the Wechsler Preschool 
and Primary Scale of Intelligence — Third Edition (WPPSI-III) (Wechsler, 2002). The 
Weschler series of intelligence tests has significantly influenced intelligence testing 
and the profession's conceptualization of intelligence, and is reviewed in more detail 
in Chapter 13. 

A second period of intense social and political criticism developed during the 
1960s and 1970s due to several societal factors, including the civil rights movement 
and congressional hearings into rights to privacy. This period was termed the Era of 
Discontent by Maloney and Ward (1976). Several influential books and court cases 
occurred during this period. Whyte (1956), in Organization Man, accused users of 
employment and other selection tests of choosing workers who fit the organizations 
structure, or status quo, rather than those who would do the best work or were most 
qualified. Houts (1977), in The Myth ofMeasurability, insisted that tests were instru- 
ments of oppression used by the privileged to control the poor. Houts maintained 
that tests punished creative individuals, caused irreparable damage to children 
through educational labeling, and generally were being used to make decisions the 
tests either were not meant to make or lacked the technical adequacy (e.g., reliabil- 
ity, validity) to make. 

In 1967, in Hobson v. Hansen, a federal judge determined standardized group 
ability tests to be biased and discriminatory against minorities, rendering the tests 
unacceptable as placement tests for special education. In 1979, another federal 
judge made a similar ruling regarding individualized intelligence tests in Larry P. 
v. Wilson Riles. During the 1990s, New York State's Truth in Testing Act, ostensi- 
bly passed over concern about the possible misuse of Scholastic Assessment Test {SAT) 
test scores, requires the release of all questions used on the administration of the 
SAT after it has been conducted. While perhaps well intentioned, this law allows 
the public to view every question comprising recent versions of SAT administra- 
tion, in effect making the items unavailable for further use. Such a practice drives 
up the cost to consumers (i.e., the parents of college-bound youth), because the 
College Board must spend a great deal of extra money to constantly create new 
items that have a one-time-only use. 

During this period, many expressed concern over the widespread use of intelli- 
gence and personality tests in employment and school testing programs (Thorndike, 
1997). Indeed, in 1972, the National Education Association actually called for an 
end to routine standardized achievement, aptitude, and intelligence testing. It was 
feared that such tests could be used, intentionally or unintentionally, to discriminate 
against people, particularly women and minorities. It was demonstrated that the 
content of some tests did, in fact, lead to discrimination in decision making, al- 
though not to the degree critics insisted was the case. However, as reported by 
Anastasi (1976), tests were already routinely being used to make decisions about col- 
lege admissions, schoolchildren with learning difficulties, and adult populations with 
special needs. Often these test scores were used to make decisions that were beyond 
the test's technical specifications, leading to widespread criticism, disillusionment, 
and skepticism. 

Again, test developers viewed these criticisms as challenges to be overcome 
through scientific study and developed procedures and methods to identify and cor- 



Foundations of Assessment 55 

rect biased test content. This process led to a movement to develop culturally fair 
and unbiased tests that is firmly implanted to this day. Nevertheless, in spite of ef- 
forts by the test publishing industry to address these issues and to allay public con- 
cerns, periodic legislation and court decisions occur that restrict the use of tests, be- 
cause no test, no matter how well developed, is perfect. Furthermore, tests are 
interpreted by professionals with varying levels of training and expertise, and mis- 
takes can and do occur. Of course, it is these mistakes that end up in legislative 
houses and court buildings, reported by the press, and concerning the public. A 
moderate amount of public wariness regarding testing can be expected to continue 
well into the future, and is probably helpful in keeping test developers and test users 
focused on best practices of test use. We further explore public concerns about test- 
ing later in this chapter. 

Achievement Assessment 

On the achievement testing front, a major shift in educational assessment occurred 
in 1845 when the Boston public school system opted for written essay exams over 
the traditional oral exams (Anastasi & Urbina, 1997). Interestingly, the arguments 
in favor of moving to this radical form of testing included broader content coverage, 
standardized conditions, standardized item selection, and reduced possibility of 
favoritism. If these criticisms of oral testing sound familiar, they should. Several were 
the same arguments later used to replace essay exams with the multiple-choice 
format. 

Between 1897 and 1903 in the United States, Joseph Mayer Rice tested tens of 
thousands of students to create the first large-scale standardized tests of spelling, 
arithmetic, and language. Rice's work stimulated additional attempts at standardized 
test development by Edward L. Thorndike of Columbia University's Teachers 
College. During the early 20th century, the Teachers College became the hub of ef- 
forts to standardize educational tests, and Thorndike and the assessment specialists 
he trained were at the center of the revolution. It was at this time that issues of sub- 
jectivity of essay and extended-response items were explored. Test developers and 
users of this era were quick to notice that judges often did not agree on the "correct- 
ness" of a constructed answer. As a result, multiple-choice and other forced-choice 
item response formats were developed and came into prominence during the first 
several decades of the century. The advent of test scoring machines around the mid- 
dle of the century made multiple-choice formats even more popular, as thousands of 
test protocols could be scored with ever-increasing efficiency (i.e., less time, greater 
accuracy, fewer scorers required). 

In the first two decades of the 20th century, achievement was measured either 
by a single test combining several subject areas into a single score, or by a single test 
constructed to measure a single subject area score. In 1923, Truman L. Kelly, Giles 
M. Ruch, and Terman published the first edition of the Stanford Achievement Test 
{SAT — not to be confused with the Scholastic Assessment Test), which is currently in 
its 10th edition. The Stanford Achievement Test was the first standardized achieve- 
ment battery and was designed to measure several subject areas simultaneously and 
to report each area score separately. In this way, a teacher could understand a stu- 
dent's separate performances in math, reading, and spelling through administration 



56 Chapter 2 



of a battery of achievement tests. Also, the SAT provided a national standard of com- 
parison so that performance of students in one school could be compared to that of 
students in various other parts of the country. As normed, multiple-choice measures, 
standardized achievement tests had many advantages over teacher- administered and 
teacher-scored essay-based tests, which had previously dominated public and private 
school education. Standardized achievement tests were relatively easy to administer 
and score, objective (i.e., minimized favoritism), and less expensive; covered broader 
ranges of content; and gave a measure of student performance against that of others 
in the same grade. By the 1930s, standardized achievement tests were widely viewed 
as more reliable, meaningful, and fair than essay tests (Anastasi & Urbina, 1997). 

Numerous group-administered achievement test batteries have been developed 
over the years. In 1936, Everett F. Lindquist published the Iowa Every-Pupil Tests of 
Basic Skills, an achievement battery known today as the Iowa Tests of Basic Skills (6th 
edition). Lindquist also later developed an electronic test scoring method that made 
mass scoring of multiple-choice questions quick and inexpensive. The Metropolitan 
Achievement Test, originally published in 1931, is now in its 8th edition. A recent ar- 
rival, TerraNova 2(CTB/McGraw-Hill, 2001), resulted from a merging of the most 
recent revisions of the Comprehensive Test of Basic Skills and the California 
Achievement Test. In 1969, the United States launched the National Assessment of 
Educational Progress program to determine the effectiveness of the country's educa- 
tional system and track changes in student characteristics and performance over 
time. The program is still in operation today. 

Perhaps the single most influential occurrence in educational testing was the 
passage of Public Law 94-142— The Education for All Handicapped Children Act 
(1975), which provided federal oversight and funding for special education programs 
across the country. Refunded in 1990 and now known as the Individuals with 
Disabilities Education Act (IDEA), this landmark legislation led to the widespread 
use of individualized intelligence and achievement tests in public schools. Public Law 
94-142 resulted in educational services being provided to millions ol students who 
have substantial learning problems, including learning disabilities, mental retarda- 
tion, emotional disturbances, and visual, hearing, or orthopedic impairments. 

Several important individual achievement batteries were developed in the late 
1970s and 1980s to address the need for assessment of learning problems and were 
immediately put to use to assess the achievement ol children and adolescents. These 
batteries included the Woodcock-Johnson Tests of Achievement, now in its third edition 
(WJ-IIIACH) (Woodcock, Mather, & McGrew, 2001); the Peabody Individual 
Achievement Test, now in its revised edition (P/A'T-R) (Markwardt, 1998); and the 
Wechsler Individual Achievement Test (1992), now in its second edition {WIA'T-If) 
(Wechsler, 2001b). 

At about the same time as Public Law 94-142, Congress passed the U.S. Reha- 
bilitation Act of 1973. While the act was well known at the time for requiring wheel- 
chair access ramps, curb cutting, and elevators in buildings and localities that ac- 
cepted federal hinds, some provisions went unnoticed until years later. Section 504 
of this act required that any individual with a mental or medical impairment that 
affects occupational, learning, or social functioning (among others) is entitled to 



Foundations of Assessment 57 

accommodations to facilitate success. Section 504 accommodations are commonly 
provided to students in schools today whose mental or medical conditions are not so 
severe as to qualify for services under IDEA. 

During the 1980s and 1990s, many educators criticized the reliance on mul- 
tiple-choice testing on the grounds that it does not allow assessment of students' 
understanding of depth of content or reasoning, their ability to integrate knowl- 
edge from various aspects of a discipline of knowledge, or their ability to explain 
complex thoughts and ideas, because they are only required to color-in a bubble, 
rather than to construct their own meaningful written response. This backlash led 
a large number of states and school systems to develop "authentic," or perform- 
ance-based, assessment programs. Generally, these assessment programs present 
students with real-life problems to be solved, usually resulting in some constructed 
essay response. However, while multiple-choice questions present with strengths 
and limitations, so do performance-based tests. As explained in Chapter 1, one of 
the very important primary problems with performance-based assessment is its 
lower test score reliability. Many performance-based assessments do not reach a 
minimally acceptable standard of reliability to report an individual student's score. 
The passage of the No Child Left Behind Act of 2001 (NCLB) will likely reduce 
the use of performance-based tests because it requires that individual scores be re- 
ported in reading and math for students in grades 3 through 8. Still, many educa- 
tors view portfolios and performance-based assessments as better indicators of stu- 
dent performance than multiple-choice tests (Muir & Tracy, 1999; Russo & 
Warren, 1999). 

Vocational and Career Assessment 

Although he never developed a standardized assessment of vocational development, 
Frank Parsons was a pioneer in the vocational guidance movement and has come to 
be known as a founder of the school guidance movement. He advocated for the un- 
derstanding of the person and the world of work so that an individual could be 
matched with an appropriate occupation. Thus the specialized field of career assess- 
ment was born. Numerous applications and venues for career assessment have de- 
veloped over the years, and career assessment often integrates knowledge of an indi- 
vidual's aptitudes, achievements, interests, competencies, values, and beliefs. 

Around World War I, aptitude testing became critically important in the mili- 
tary (e.g., Army Alpha and Army Beta), followed by more specific applications to vo- 
cational choices. The gains made in the field of intelligence testing coupled with the 
realization that multiple abilities could be assessed (not just g) led to widespread 
applications in aptitude assessment. For example, scholastic aptitude tests were 
developed as far back as the 1920s to help identify students with the capabilities to 
meet the academic challenges of higher education. 

During the 1920s and 1930s, the use of aptitude tests became common in 
industry for the selection and classification of employees. Specialized tests measuring 
mechanical and clerical aptitudes were particularly commonly used. Perhaps more 
important in the long run, several vocational interest inventories were developed, 
foreshadowing the importance vocational counseling would hold in the future. 



58 Chapter 2 



During this time, Edward K. Strong published the Strong Vocational Interest Blank 
(today known as the Strong Interest Inventory), and Frederick Kuder published the 
Kuder Preference Record — Vocational. 

During World War II, the armed services again had great need to identify re- 
cruits who could fulfill increasingly technical job responsibilities. This need, along 
with development and refinement of the statistical technique known as factor analy- 
sis, led to the further development of specialized aptitude tests and the general mul- 
tiaptitude batteries. These multiaptitude batteries could help identify an individual's 
strengths and limitations, as well as predict performance in certain academic and vo- 
cational tasks. They still enjoy widespread popularity in many high school career as- 
sessment programs today because they can provide insights into intrapersonal 
strengths and weaknesses, thus helping to determine which higher education or vo- 
cational choices may make a good fit. These multiaptitude batteries, further de- 
scribed in Chapter 1 1, include the General Aptitude Test Battery, the Differential 
Aptitude Test (DAT), and the Armed Services Vocational Aptitude Battery (ASVAB). 

In 1959, in response to the successful Soviet launching of the first satellite, 
Sputnik, Congress passed the National Defense Education Act, funding school guid- 
ance counselor positions in high schools across the country with the express purpose 
of identifying students showing promise in the mathematical and science fields. 
Professional school counselors quickly learned to rely on career aptitude and inter- 
est inventories to help with this task. Numerous vocational interest, career values, 
and belief inventories have been published over the past 50 years, aiding counseling 
professionals in effectively addressing the critically important role career counseling 
plays in society today. In particular, career counselors, college counselors, and pro- 
fessional school counselors frequently use and encounter vocational aptitude and as- 
sessment instruments in their work. 

Clinical and Personality Assessment 

Clinical assessment pertains to the identification of mental disorders and related 
syndromes. Personality assessment is the applied area of psychology and counseling 
concerned with the measurement of nonintellectual affective characteristics. 
Importantly, many use the term personality in the broadest holistic sense and actu- 
ally include the measure of intellect, aptitude, and achievement under a global cate- 
gory (Anastasi & Urbina, 1997). However, in the parlance of psychological and ed- 
ucational assessment, personality assessment is generally most concerned with 
attitudes, characteristics, motivations, and interpersonal and affective traits. 

During World War I, the U.S. armed forces became interested in identifying re- 
cruits who were psychotic or otherwise not emotionally capable of military service. 
Asked to develop a personality inventory that could be efficiently administered to 
large groups of recruits, in 1919 Robert S. Woodworth developed the Woodworth 
Personal Data Sheet, basically a structured papcr-and-pencil psychiatric evaluation. 
WWI ended without the original test ever being put into use. However, this proto- 
col was later released for civilian use, and its creation spurred development of an en- 
tire generation of self-report personality and clinical inventories during the 1920s 
and 1930s. Unfortunately, these self-report tests assumed that respondents would 



Foundations of Assessment 59 

answer truthfully and be'of sound mind and judgment. Of course, those the test was 
meant to assess might be of neither sound mind nor judgment; the tests were trans- 
parent and responses easily faked. For example, one of the more famous questions 
from the Woodworth was, "I drink a quart of whiskey each day." From a social de- 
sirability perspective, it is even very easy for persons with an addiction to alcohol to 
see the consequences of such a question. No real procedures were in place for cross- 
validation of responses, so clinicians frequently made decisions based on untruthful 
responses, resulting in tremendous criticism of this burgeoning and promising area 
of assessment. 

Test developers again went to work devising "validity scales," subscales that 
attempt to measure a client's forthrightness when answering questions. A milestone 
in personality and clinical test construction occurred in 1940, when Starke R. 
Hathaway and J. Charnley McKinley published the Minnesota Multiphasic Per- 
sonality Inventory (MMPT). This test led a resurgence of self-report personality inven- 
tories because it addressed the issue of respondent forthrightness and developed sev- 
eral validity scales that helped examiners to identify potentially invalid test protocols. 
The MMPI has become the most commonly used and widely researched structured 
clinical inventory in the history of assessment. Importantly, the MMPI was devel- 
oped and used to assess the clinical population for mental and emotional disorders, 
not the personality functioning of nonclinical individuals. However, the success of 
the MMPI in addressing the critics of self-report inventories spurred numerous other 
clinical inventories (e.g., Millon Clinical Multiaxial Inventory {MCMP), Beck 
Depression Inventory (BDI), Achenbach System of Empirically Based Assessment 
(ASEBA) and behavioral inventories (e.g., Conners' Rating Scales (CRS-R), Behavior 
Assessment System for Children (BASC) for clinical purposes, as well as personality in- 
ventories used with the general population (e.g., Myers-Briggs Type Indicator, 16 PF). 
The MMPI is now in its second edition (MMPI-2) (Butcher et al., 1989) and also 
has an adolescent version, the Minnesota Multiphasic Personality Inventory — 
Adolescent {MMPI-A) (Butcher et al., 1992). The MMPI is also somewhat different 
because it was not developed using factor analysis; instead it relies on items that are 
empirically derived and criterion based. Currently, trait perspectives and the. five- 
factor model (Costa & McCrae, 1992) dominate the field of structured personality 
assessment. Traits are enduring characteristics, and the research on personality assess- 
ment appears to consistently identify a limited number of traits that underlie per- 
sonality functioning (e.g., optimism, extroversion, openness to experience). This 
model and numerous clinical and personality inventories are explored further in 
Chapter 8. 

Another method of personality assessment was conceived at about the same 
time as Woodworth's self-report measure. In 1921, Swiss psychiatrist Hermann 
Rorschach created a set of inkblots that aspired to provide examiners an x-ray view 
of a client's personality. The Rorschach Inkblot Test sought to explore individuals' 
unconscious thoughts and reelings by allowing them to "project" these thoughts, 
feelings, needs, hopes, fears, and motivations onto ambiguous stimuli in an un- 
structured task — in this case a blot of ink on a piece of paper that was folded in 
half to form an otherwise meaningless, bilaterally symmetrical design. The inkblot 



60 Chapter 2 



itself holds no meaning; clients attempt to structure the activity by projecting 
meaning from the perspective of their own worldviews and particular personali- 
ties. Response requirements are purposefully unclear, and the scoring criteria often 
are very subjective. The technique did not catch on immediately in Europe but be- 
came very popular in the United States during the 1930s and 1940s, when it was 
adopted by many psychoanalysts, who viewed it as consistent with Freud's goal of 
exploring the unconscious. The technique became even more popular in the 1950s 
and 1960s as the field of clinical psychology and personality assessment in general 
grew tremendously. 

Numerous other projective tests have been developed, including single-word as- 
sociations (e.g., "Say the first thing that comes into your mind when I say the word 
mother"); incomplete-sentence blanks (e.g., "Complete this sentence: Friends think 

I "); and drawing and storytelling tasks. In 1935, Henry A. Murray and 

Christiana D. Morgan published the Thematic Apperception Test (TAT), which aimed 
to give clinicians insight into client personality functioning by having the client look 
at ambiguous pictures and tell a story about each. Ostensibly, clients would project 
their needs and motivations into the story, yielding valuable clinical insight. Well- 
known drawing techniques include the House-Tree-Person (H-T-P) and Kinetic 
Family Drawing (KFD) techniques. For example, in the H-T-P clients draw pictures 
of a house, a tree, and a person, and the examiner generally asks a number of follow- 
up questions about each drawing. Each technique shares a common thread: There 
are no right or wrong answers, just what is on one's mind and projected into the sit- 
uation. Projective tests have the advantage of promoting forthrightness in clients be- 
cause they usually have no idea what is expected, and therefore find it difficult or 
unnecessary to be deceitful. 



Think About It 2.1 What events or issues appear to consistently spark 
the interest of the government and citizenry in testing? 



General Historical Events Affecting Assessment 

While the specialized disciplines of assessment (intellectual, achievement, career, per- 
sonality) each contributed milestones of import, many more general events con- 
tributed to the integration and advancement of the field. And while many of the 
landmark advances in testing stemmed from wartime needs, the successful use of 
tests in the military led to their widespread use in other avenues of society, includ- 
ing education and industry. Important societal needs during the middle decades of 
the 20th century drove this utilization, including free public education, substantial 
population increases, mandatory school attendance, large increases in the number of 
college-bound youth, civil rights movements for women and minorities, and the 
rights of handicapped children and adults. Many of these testing initiatives stemmed 
not only from general societal concerns, but also from specific test-related issues such 
as sexual bias, cultural bias, and unfairness to certain segments of the population, all 
leading to improvement in the development of tests. 



Foundations of Assessment 61 

The rapid advancement of testing in the 1920s and 1930s led to a tremendous 
need to identify, catalog, and provide critical evaluations of available instruments. To 
fill this need, in 1938 Oscar K. Buros published the first edition of the Mental 
Measurements Yearbook (MMY). A new edition of these test reviews is produced every 
couple of years and is now available in full text (online or CD-ROM) through most 
university library systems. 

With the proliferation of thousands of tests being published during the first half 
of the 20th century, test developers and examiners realized that there was a lack of 
standards governing the development and use of psychological and educational tests. 
The American Psychological Association published a guidebook of technical recom- 
mendations for test use in 1954 and was joined by the American Educational 
Research Association (AERA) and the National Council for Measurement and 
Evaluation (NCME) in 1974 to publish the first edition of Standards for Educational 
and Psychological Tests. These standards were revised in 1999 and continue to serve 
as a resource for the use and evaluation of tests. Likewise, the Association for 
Assessment in Counseling and Education (AACE) published the Responsibilities of 
Users of Standardized Tests (RUST-3) statement, which is now in its third edition 
(AACE, 2003a). 

In one of the first cooperative mergers among test publishers, the American 
Council on Education (ACE), the Carnegie Corporation, and the College Entrance 
Examination Board (CEEB) combined forces during the 1950s to establish the 
Educational Testing Service (ETS). This merger centralized the publication and scor- 
ing of some important tests into a profitable and convenient joint endeavor. ETS 
continues to publish the Scholastic Assessment Test (SAT) and the Graduate Record 
Exam (GRE) to this day. 

In education, the pendulum continues to swing. Most notably, the humanistic 
orientation of the 1970s was replaced by a back-to-basics movement and the current 
standards-based and high-stakes approaches to assessment. The back-to-basics move- 
ment led many states to develop minimum competency examinations that were de- 
signed to ensure that students graduating from high school had the minimum essen- 
tial academic skills to function in a modern society (Lerner, 1981). High-stakes 
testing (a chapter on this subject is available on the companion website) may result 
in students not being promoted to the next grade or not graduating from high school 
unless achieving a certain minimum level of proficiency measured by the test. 
Similarly, some states have mandated examinations for teachers to demonstrate that 
they can read, write, and communicate effectively and that they have mastered the 
content of the subject they were hired to teach. 

Several significant pieces of legislation were passed during the 1970s, including 
the 1974 Family Educational Rights and Privacy Act (FERPA), which mandated the 
rights of parents and children over the age of 18 years to view school records and re- 
quired parental consent for assessment conducted around specific topics. 

Computers have changed the complexion of assessment and will continue to do 
so for the foreseeable future. Computers can now be used to administer, score, and 
interpret numerous psychological and educational tests, greatly aiding the efficiency 
of the process. Now examiners can receive scoring and interpretive services in the 



62 Chapter 2 



comfort of their own offices for assessment instruments as diverse as career, achieve- 
ment, and intelligence tests — even the MMPI-2 and Rorschach. 

Computer-assisted career guidance programs were devised in the 1960s and 
continue to grow in strength and purpose even today. High school students regularly 
cruise the Internet to take online career inventories, find information about career 
and educational opportunities, and even locate scholarship funds and complete on- 
line college and job applications. Accessible, low-cost, and quick, the immediate re- 
sults and feedback of such innovations are the primary reasons for their continued 
success (Zunker & Norris, 1998). 

Adaptive testing has made administration and scoring of large-scale testing pro- 
grams even more efficient. College students taking the GREs can now spend less 
time on the computer-administered version than they would sitting in a classroom 
with a paper-and-pencil version, and they can even find out their scores at the con- 
clusion of the tests rather than anguishing for weeks. Schools can now receive com- 
puter-generated interpretive reports that can be given to parents so that they may 
understand their children's performances. Clients can take tests online, via the 
Internet, making assessment incredibly convenient and efficient for everyone. 
However, with technological innovation come ethical and legal challenges, topics 
that are addressed later in this chapter. 

Issues of diversity in assessment have been addressed by several professional or- 
ganizations, and the AACE has compiled a list of these standards (http://aace.ncat 
.edu). During the 1990s, education experienced a shift toward performance- based, 
authentic assessment, which strives to assess students' depth of understanding by hav- 
ing them perform a task rather than take a pencil-and-paper examination. Likewise, 
an assessment initiative known as portfolio assessment became very popular during 
this time. Used for decades in modeling, art, and architecture, portfolios are a col- 
lection of performance products or samples that can be displayed and evaluated ac- 
cording to quality indicators. Breadth and depth of understanding displayed through 
real-life performance is key to this form of assessment. 

In summary, the past century has witnessed the many ups and downs of testing 
as well as professional and technological innovations. Many criticisms have been pro- 
posed, leading to changes in test development procedures and administration prac- 
tice. The next section explores some of these concerns in more detail. 



PUBLIC AND PROFESSIONAL CONCERNS 
ABOUT ASSESSMENT 



Millions of tests are given annually to help make decisions about peoples' lives. The 
scope of test use in the United States alone is immense. The No Child Left Behind 
Act of 2001 requires standardized testing of all public school students in grades 3 
through 8. Nearly 2 million high school students take a college admissions test such 
as the 5>iror/4C7*each year. Almost 75,000 take a special admissions test lor busi- 
ness school, and more than 100,000 take one lor law school admission. 



Foundations of Assessment 63 

Tests are important and helpful sources of information that, when used appro- 
priately, help decision makers make better, more accurate decisions than can be made 
without the use of assessment instruments. However, sometimes the process does not 
work as planned. Decision makers may sometimes misunderstand the purpose of a 
test or use tests to make decisions for which the test scores were never validated. 
Sometimes the actual assessment process or the criteria for success are perceived as 
unfair by professionals or the public. Finally, the issue of testing has sometimes been 
viewed as a political tool, and has been used as one by some critics. Testing is big 
business, meaning big money. Also, allocation of resources for schools and individ- 
uals with disabilities or certain economic considerations is frequently tied to test per- 
formance. For example, in some states higher-performing schools meeting state goals 
have been rewarded with monetary compensation (e.g., program funding). In oth- 
ers, lower-performing schools have received increased levels of funding for new aca- 
demic initiatives to help close the achievement gap. In mental health clinics and 
practices around the country, third-party reimbursement is achieved through assess- 
ment and diagnosis of mental disorders. Eligibility for special education services 
under IDEA or accommodations under Section 504 of the U.S. Rehabilitation Act 
of 1973 involve assessment procedures preceded or followed by funding allocations. 
In many ways, funding and assessment go hand in hand, meaning that politics are 
inevitably involved. 

Ebel (1976) indicated that primary critics of testing include professional educa- 
tors concerned about the effect standardized testing has on accountability and cur- 
riculum in the schools, reformers who view standardized testing as outmoded and 
counterproductive to quality instruction, and media representatives looking to reveal 
scandalous proceedings in social institutions. In fairness, the majority of teachers and 
the vast majority of parents support the use of standardized testing, but a vocal, polit- 
ically motivated minority keeps the issue at the forefront of national attention. 

This is not to say that standardized testing has not been used in ways deserving 
of criticism. Table 2.2 lists numerous issues creating public concern, even com- 
plaints. Throughout this book, best practices meant to mitigate each of these com- 
plaints will be addressed in some manner. Here we give a brief treatment of these 
complaints. 

Table 2.2 Some public complaints about tests 

■ Decisions about children's lives should not be made on the basis of a single high-stakes test 
score. 

■ Tests are biased and unfair to minorities and women. 

■ Tests create anxiety and stress. 

■ Tests label and categorize. 

■ Test developers dictate what students must know or learn. 

■ Teaching to the test inflates scores. 

■ Multiple-choice questions punish intelligent, creative thinkers; trivialize the complexities of 
the learning process; and reward good guessers. 



64 Chapter 2 

Decisions About Peoples' Lives Should Not Be Made 
on the Basis of a Single High- Stakes Test Score 



We couldn't agree more! Professional counselors who make decisions about the lives 
of others using a single test score are behaving unprofessionally, unethically, and, de- 
pending on the location of practice, perhaps illegally. All major national professional 
organizations agree on this point, as a quick perusal of major national organization 
position statements on high-stakes testing will support. The same is generally true in 
education. For the past 30 years and continuing through today, U.S. law has forbid- 
den placement of students in special education classes on the basis of a single test. 
Today, legal battles have ensued over a state's ability to withhold a diploma from a 
high school student who met all curricular requirements and passed all academic 
coursework but failed to obtain a minimum acceptable score on the state's high- 
stakes test. Numerous universities "require" a certain SA T or ACT score for admit- 
tance but state that the admissions process "takes other factors into consideration." 
An axiom in assessment by counseling professionals should be that decisions about 
peoples' lives should be made using multiple sources of information provided by multiple 
respondents. Using a single piece of data or data provided by a single source to make 
an important decision about a person is just plain wrong. 



Tests Are Biased and Unfair to Minorities and Women 



This issue receives far greater treatment later in this chapter, but for now it is impor- 
tant to understand that tests are used to predict some performance criterion, and that 
the concepts of fairness and bias have to do with how effectively tests accomplish this 
goal for differing groups of individuals (e.g., race, gender). Thus, if an intelligence 
test differentially holds some groups to an advantage and others at a disadvantage in 
predicting the performance criterion, it could be biased. In modern practice, test au- 
thors regularly go to great lengths to ensure fairness in test content, but because cul- 
tures vary, bias of individual items may vary also. 

Of course, it is essential that the performance criterion be equally free from bias. 
An example is the sometimes-reported observation that standardized achievement 
tests must be biased against girls because boys sometimes outperform girls on mul- 
tiple-choice tests, but girls get higher grades in school-based classes. It is easy to jump 
to this conclusion, except for one thing. Consider that the standardized test scores 
are objectively derived and subjected to bias analyses. Can the same claim be made 
for school grades? Nearly any school teacher will confirm that, on average, girls turn 
in homework more frequently, prepare for exams and study more, are better behaved 
in the classroom, and generally get higher test scores than boys. If this is the case, 
girls should get higher grades than boys, but higher grades do not necessarily mean 
that one knows more or has better mastery of the course content. Given this context, 
it is just as logical to conclude that the criterion (grades) is more biased against boys 
than standardized tests against girls. The point is, always consider the bias and fair- 
ness of both the predictor (i.e., the variable/test score used to predict the criterion) 
and the criterion. 



Foundations of Assessment 65 



Tests Create Anxiety and Stress 



That tests create anxiety and stress is, of course, true; but not always in the way many 
fear. Large-scale group-administered testing certainly creates a degree of stress that, 
hopefully, reaches a moderate level. Remember the Yerkes-Dodson law: Moderate 
anxiety maximizes performance; low and high anxiety minimize performance. Of 
greatest concern is a student's phobic or panicked reaction due to a high degree of 
anxiety, usually with high-stakes tests. While there is certainly anecdotal evidence to 
support this claim of high degrees of pressure being placed upon students (includ- 
ing physical illness, vomiting, and crying), this claim is not true for the vast major- 
ity of students. Professional counselors understand that a small percentage of the 
population suffers from test phobia and take steps to treat it when appropriate. 
Professional counselors also understand that a significant proportion of the school- 
aged population may be diagnosed with an anxiety disorder (see Chapter7), usually 
Generalized Anxiety Disorder, and take steps to treat these difficulties when appro- 
priate. Anxious people are likely to get upset about tests and myriad other life events. 
Professionals need to predict who will be affected and to take preventive and inter- 
ventive measures. All told, the vast majority of individuals are not harmed or unduly 
upset by standardized testing. In fact, most educators are far more concerned about 
the other end of the spectrum — unmotivated students who care too little and do not 
get anxious enough about tests. 



Tests Label and Categorize 



While it is true that tests label and categorize, technically speaking, it is the decision 
makers (e.g., professional counselors, multidisciplinary team members, or mental 
health professionals) who label and categorize. Frequently, labeling is a necessary evil 
in society because labels are used to identify individuals in need of, and entitled to, 
services. For example, identifying a child with a learning disability is a step toward 
obtaining the educational services the child may need for academic success. Clinical 
tests are often used to identify individuals with mental disorders so that third-party 
(i.e., insurance company) reimbursement can be obtained for counseling services. In 
this way, tests can be a valuable aid in making more accurate decisions about the cat- 
egories that clients and students are determined to fit. 

While the public holds many concerns about labeling of clients and students, 
much of the concern about the use of labels lies in two areas: (1) that tests may be 
used to mislabel an individual, and (2) that labels may be used as an excuse for some 
remediable (or even nonexistent) condition. Professional counselors must always be 
aware of the potential for misidentification. Tests are not perfect predictors; nothing 
is. Tests are instruments that inform the decisions of professional counselors and 
must be used with other sources and types of information to arrive at accurate deci- 
sions. Inaccurate labels tend to have detrimental consequences for clients, sometimes 
lasting for many years. For example, a 7-year-old boy inaccurately identified as men- 
tally retarded may spend three or more years in an instructional program specially 
designed for students with mental retardation. A young man inaccurately diagnosed 



66 Chapter 2 



with schizophrenia may not only receive improper treatment, but be followed by an 
erroneous paper trail and even wrongful discrimination in the workplace. 

Others may use a label as an excuse for not trying in school or not pursuing 
effective treatment strategies. For example, children with Attention-Deficit/ 
Hyperactivity Disorder (AD/HD) may use the condition as an excuse for not try- 
ing hard in math. Worse, teachers and parents may use the diagnosis as an excuse 
for not encouraging such students to put more effort into their studies. Excuses 
such as, "He has a poor memory," "She can't write well so shouldn't be expected 
to," or "He'll always be disorganized" may be true to a certain degree but also may 
become self-fulfilling prophecies with no effort put forth to ever cope and com- 
pensate for difficulties. 



Test Developers Dictate What Students Must Know or Learn 



Developers of achievement tests select items that measure the domain of knowledge 
being assessed. They use several methods in this process, including curriculum and 
textbook reviews, reviews of previously available tests, and consultation and evalua- 
tion of experts in the given content area. The goal is to develop a test that faithfully 
and accurately samples the domain of knowledge. In today's standards-based and 
large-scale (group) assessment atmosphere, it is common for state departments of ed- 
ucation to develop their own learning standards and instructional objectives and to 
contract with publishers to measure those standards and objectives. Good curricu- 
lum evaluation starts with well-defined standards, which are then implemented 
through an effective curriculum (including benchmarks, instructional objectives, 
and instructional activities) and appropriately assessed. 

The key is for the test or assessment program to align perfectly with the curricu- 
lum, and for the curriculum to align perfectly with the standards. In the past, many 
large-scale achievement tests were "off the shelf" and thus may or may not have 
aligned with a given school's curriculum. Misalignment can result in lowered test 
scores. For example, if a curriculum teaches only half of what an achievement test 
measures (i.e., 50% overlap between test and curriculum), then low scores will result. 
Unfortunately, it was difficult for educators to determine whether low student scores 
were due to misalignment (i.e., students were not taught half of what they needed to 
know to do well on the test) or poor skills (i.e., students did not master the half of 
the items that they were taught). 

Recently, educators and test publishers have worked collaboratively to develop 
large-scale tests that are tailored to state needs and aligned with state learning stan- 
dards. Frequently, these tests are composed of "off the shelf" items that do apply 
to the state standards and are augmented with item pools that measure additional 
specific state standards. In this way, test items align more precisely with state stan- 
dards, and the burden is on school systems and individual teachers to develop and 
implement an effective curriculum. The mechanics of this issue is addressed in the 
chapter on high-stakes testing, which is available on the companion website for 
this text. 



Foundations of Assessment 67 



'Teaching to the Test" Inflates Scores 



As a continuation of the previous criticism, teachers are supposed to implement a 
curriculum that provides the bridge between standards and assessment. "Teaching to 
the test," a phrased loathed by most educators, means that the focus of instruction 
becomes so precribed that only content that is sure to appear on an exam is addressed 
in instruction. Obviously, if this occurs, test scores should rise. 

Whether test scores are inflated in this instance is a matter of content mastery. 
Consider an example from the classroom. Teachers "teach to the test" all the time in 
the regular curriculum. They have a learning objective — say single-times-single-digit 
multiplication (e.g., 3x6= 18, 7x8 = 56); instruct students in the process for ar- 
riving at correct solutions; assign activities in class and for homework to enhance stu- 
dent mastery; and then, finally, test student knowledge with some kind of teacher- 
made or textbook examination. If the students are prepared and motivated, and the 
teacher implements the instruction efficiently, students should receive high scores. 
Whether the scores are inflated depends on whether the student scores reflect mas- 
tery of the domain of behavior — that is, can the students effectively solve nearly all 
single-by-single-digit multiplication problems. If the answer is yes, great — that was 
the goal. In contrast, assume the teacher decides ahead of time that the test will be 
comprised of 10 items and the students are instructed and drilled only on those 10 
items. It is quite likely that the students will do very well on the examination but not 
be very proficient at calculating items from the broader domain. In this example, the 
test scores do not accurately reflect the level of mastery of the total domain. As a re- 
sult, it can be said that the scores are inflated. 

To solve this dilemma, test publishers, state education departments, and local 
educators must work collaboratively to develop test items that adequately sample the 
broad content domain and standards. Equally important, these entities must protect 
and secure the test content so that teachers do not know which items will appear. 
This ensures that student test performance reflects content mastery, not the teaching 
of how to solve specific items. In the end, if teachers understand the standards, are 
provided with an effective curriculum and material resources, and effectively imple- 
ment the instructional strategies, then motivated, prepared students will master the 
domain of knowledge being assessed. (Note that there are a lot of "ifs"!) 



Multiple-Choice Questions Punish Intelligent, Creative Thinkers; 
Trivialize the Complexities of the Learning Process; 
and Reward Good Guessers 



While multiple-choice questions can effectively measure knowledge and skills in di- 
verse areas, it would be absurd to propose that they can effectively measure every- 
thing. Sometimes extended-response items (e.g., essays) or performance evaluations 
are necessary because they allow for the assessment of applied skills and more thor- 
ough explanations. For example, in the training of professional counselors, it is a 
necessary and common occurrence for the trainee to be observed actually counseling 



68 Chapter 2 



clients, either live or on video. No multiple-choice or essay test can substitute for this 
performance assessment. That is not to say that certain knowledge components of 
the counseling process cannot be tested — only that the act of counseling is a fluid, 
applied process that happens with real people. In some instances, indirect measures 
cannot be substituted for direct measures. 

Whether multiple-choice items measure trivial or meaningful information is re- 
ally in the hands of the test developer. Remember from the discussion above that test 
items are created to measure some standard or objective so that an inference can be 
made about the mastery of a domain of behavior. Thus if the standard or objective 
is trivial, so will be the question. Well-crafted multiple-choice questions can meas- 
ure advanced, high-level thinking every bit as well as other response formats. It all 
comes down to the skill of the item writer. 

The criticism is often made that students who are "good guessers" or "lucky 
guessers" can get significantly higher scores on a multiple-choice test. However, the 
facts simply do not support this assertion. On a typical four-choice, multiple-choice 
question, the likelihood of getting a question correct just by guessing is 25% (0.25). 
Now if the test has very few items on it, getting one additional question correct 
might make a difference, but most large-scale assessments have hundreds of ques- 
tions, and subtests usually have dozens. Thus to get an appreciably higher score, one 
would have to guess correctly on several to perhaps dozens of questions. Anyone can 
beat the odds, of course; but what are the odds of beating the odds? Let's use as an 
example that students would need to guess correctly on four questions in order to ap- 
preciably increase their score. When you know that a student has a 25% (0.25) 
chance of guessing correctly on each item, the odds are easy to compute: 0.25 X 0.25 
X 0.25 X 0.25 = 0.004 — a 0.4% chance of guessing correctly on all four items. This 
means that 4 out of 1,000 students taking that subtest might get a substantially 
higher score. Now if one is a die-hard gambler, these odds are about four times 
higher than hitting the "Pick 3" Lotto — something to get excited about, perhaps. 
But in the assessment arena, few would bet their college admission prospects, or their 
grade in an assessment course, on them. 



Learning From Past Mistakes and Criticisms 



Periodically throughout history the use of tests has come under attack, and such at- 
tacks sometimes limit the widespread application of test use in society. These move- 
ments are often double edged; they highlight fair criticisms of the power that tests 
sometimes wield in decision making but fail to replace the current system with one 
that is more objective, accurate, and fair. This is the dilemma: Tests have risen to cur- 
rent prominence because they provide more objective, accurate, and hur information 
on which decisions can be made . . . but . . . because no test is perfect, errors can and 
do occur in the decisions made. What critics often fail to mention is that a systematic 
decision-making process using standardized tests most often results in fewer poor de- 
cisions than a nonsystcmatic decision-making process based on "judgment," in which 
the decision maker becomes the instrument (more on this in Chapter 7). Individuals 
exercising judgment are just as susceptible to threats to reliability and validity as tests. 



Foundations of Assessment 69 

To prevent biased judgments, professional counselors receive substantial train- 
ing in assessment. Professional counselors must understand the important concepts 
that guide the development of assessment instruments in order to become informed 
consumers. The future of assessment in counseling depends on professional coun- 
selors being able to use assessments effectively to benefit students and clients, to base 
their decisions on objective facts, and to replicate and justify those decisions on the 
basis of scientific evidence, not subjective "feel." Professional counselors have a pro- 
fessional duty and responsibility to know as much as they can about all facets of 
counseling in order to best serve and advocate for students and clients. 



ETHICS AND ASSESSMENT 



Counseling, like many other professions, is guided both by laws and by ethical stan- 
dards. Laws regulate who can perform what type of counseling, in which settings, 
and with which clients. Additionally, in the area of assessment, myriad policies and 
procedures regulate who can be or is assessed, under what circumstances, for what 
reasons, and who is qualified to administer and interpret the assessments. However, 
despite the controls that exist within the area of assessment, there is still tremendous 
room for judgment on the part of the professional regarding these issues. 
Responsibility for final decisions regarding conduct rests with counselors themselves 
(Wickwire, 2002). In the absence of laws, policies, and procedures, ethical standards 
are the basis for appropriate and professional behavior. Codes of ethics propose 
guidelines for standards of professional behavior, and it is essential for professional 
counselors to be familiar with and follow these standards in order to provide high- 
quality, professional counseling services. 

Both laws and ethical standards are based on generally accepted societal norms, 
beliefs, customs, and values (Fischer & Sorenson, 1997) and exist for the good of so- 
ciety. However, laws are more prescriptive, have been codified, and generally carry 
penalties for failure to comply. Ethical standards are generally developed by profes- 
sional associations to guide the behavior of a particular group of professionals. 
According to Herlihy and Corey (1996), ethical standards serve three purposes: to 
educate members about sound ethical behavior, to provide a mechanism for account- 
ability, and to serve as a means for improving professional practice. They also serve 
a fourth purpose — to educate, and therefore protect, the public about the standards 
of behavior they can expect from a particular group of professionals. Associations pe- 
riodically update their ethical codes to ensure continuing relevance and applicability 
and involve stakeholders in the process. The enforcement of ethical standards is the 
responsibility of the association, which is usually limited in what it can do to mem- 
bers who fail to comply. It is the responsibility of each member to voluntarily com- 
ply and behave ethically because it is the right thing to do, although sanctions for 
noncompliance may occur. 

Forester-Miller and Davis (1996) suggested that Kitchener's five moral princi- 
ples are the cornerstone of the American Counseling Association's ethical standards. 
The first is autonomy, which refers to clients' independence and right to make sound 
and rational decisions on their own. Nonmaleficence is often referred to as "do no 



70 Chapter 2 



harm"; professional counselors must avoid behaviors that place clients at risk or 
could potentially cause harm. Beneficence involves contributing to the positive wel- 
fare of clients and their growth. Justice means treating each client according to what 
is best for that client — fair treatment and consideration of each client. The last prin- 
ciple \s fidelity, which refers to honoring commitments and establishing an accepting 
relationship in which the client can trust the professional counselor. These moral 
principles are critically important in the field of assessment to ensure that clients re- 
ceive professional and appropriate services that are in their best interest. 

There are a number of codes of ethical standards, since different associations 
and divisions within the counseling profession promulgate their own codes. 
However, since all of the ethical standards are based on either the moral principles 
previously discussed or similar common values, the similarities among the codes 
are greater than the differences. These differences usually pertain to workplace set- 
ting. The American Counseling Association's Code of Ethics (2005a) will be used 
as the basis for the discussion that follows here. The Code of Ethics delineates the 
responsibilities of professional counselors toward their clients, their colleagues, the 
workplace, and themselves. It is divided into eight sections: The Counseling Rela- 
tionship; Confidentiality; Privileged Communication and Privacy; Professional 
Responsibility; Relationships With Other Professionals; Evaluation, Assessment, 
and Interpretation; Supervision, Training, and Teaching; Research and Publication; 
and Resolving Ethical Issues. Section E: Evaluation, Assessment, and Interpretation 
is reviewed below. 

Section E: Evaluation, Assessment, and Interpretation covers standards related to 
the assessment of clients, the counselor's skills, and appropriateness of assessment, 
including: general appraisal issues, competence to use and interpret tests, informed 
consent for appraisal, releasing information, proper diagnosis of mental disorders, 
test selection, conditions of test administration, diversity in testing, test scoring 
and interpretation, test security, obsolete tests and outdated test results, and test 
construction. 

Each subsection delineated below in italics is quoted from the ACA Code of 
Ethics (2005a) and accompanied by commentary. 

Section E: Evaluation, Assessment, and Interpretation 

Introduction. Counselors use assessment instruments as one component of the coun- 
seling process, taking into account the client personal and cultural context. 
Counselors promote the well-being of individual clients or groups of clients by devel- 
oping and using appropriate educational, psychological, and career assessment in- 
struments. 

E.I. General 

E.I. a. Assessment. The primary purpose of educational, psychological, and career 
assessment is to provide measurements that are valid and reliable in either compar- 
ative or absolute terms. These include, but are not limited to, measurements of abil- 
ity, personality, interest, intelligence, achievement, and perform, nice. Counselors rec- 
ognize the need to interpret the statements in this section as applying to both 
(jit, imitative and qualitative assessments. 



Foundations of Assessment 71 

E.l.b. Client Welfare. Counselors do not misuse assessment results and interpreta- 
tions, and they take reasonable steps to prevent others from misusing the information 
these techniques provide. They respect the client's right to know the results, the inter- 
pretations made, and the bases for counselors' conclusions and recommendations. 

It is the responsibility of the professional counselor to use assessment techniques 
and results appropriately and to ensure that others do as well. As mentioned in the 
discussion of the moral principles underlying the ethical standards, professional 
counselors must operate in the best interest of the client. Salvia and Ysseldyke (2004, 
p. 58) go further and state that "those who assess . . . must accept responsibility for 
the consequences of their work, and they must make every effort to make certain 
their services are used appropriately." In so doing, professional counselors use instru- 
ments that will yield reliable and valid scores so that decisions made using these in- 
struments will benefit clients. 

E.2. Competence to Use and Interpret Assessment Instruments 

E.2.a. Limits of Competence. Counselors utilize only those testing and assessment 
services for which they have been trained and are competent. Counselors using tech- 
nology-assisted test interpretations are trained in the construct being measured and 
the specific instrument being used prior to using its technology-based application. 
Counselors take reasonable measures to ensure the proper use of psychological and ca- 
reer assessment techniques by persons under their supervision. . . . 

E.2. b. Appropriate Use. Counselors are responsible for the appropriate application, 
scoring, interpretation, and use of assessment instruments relevant to the needs of the 
client, whether they score and interpret such assessments themselves or use technology 
or other services. 

E.2.c. Decisions Based on Results. Counselors responsible for decisions involving in- 
dividuals or policies that are based on assessment results have a thorough understand- 
ing of educational, psychological, and career measurement, including validation cri- 
teria, assessment research, and guidelines for assessment development and use. 

Professional associations, employers, test publishers, and test users have put safe- 
guards in place to ensure the qualifications of professionals using assessments. A 
number of guidelines and resources have been developed to assist professional coun- 
selors in this area, including the RUSTS statement (AACE, 2003a) and the 
Standards for Educational and Psychological Testing (AERA et al., 1999). These guide- 
lines and resources are discussed in other chapters of this book. However, the respon- 
sibility for appropriate use and interpretation of assessments lies with the profes- 
sional counselor. Professional counselors should conduct a thorough search to ensure 
that the instrument or assessments selected are appropriate for the client, the in- 
tended purpose, and the information needed (Wickwire, 2002). Additionally, the 
professional counselor must be trained in the assessment procedure and qualified to 
conduct the assessment. Often students take assessment classes in graduate school 
but gain little additional training during their careers. Thorndike (1997) suggested 
that the assessor withdraw from the process if insufficiently trained to provide the 
quality of services and expertise required. Professional counselors have an obligation 



72 Chapter 2 



to maintain or increase their expertise in the area of assessment if they are going to 
conduct assessment activities. 

Standard E.2 mandates that professional counselors receive periodic training 
and retraining on assessments used. Just as important, simply knowing how to ad- 
minister and score a test does not satisfy this requirement. Professional counselors 
endeavor to know as much as possible about the construct or content under study, 
including the test psychometrics, purposes for which the test has been validated, and 
other research related to the test's use. 

Professional counselors are highly trained and ensure that those under their su- 
pervision are trained to use assessments for intended purposes. When supervisees or 
employees under a counselor's supervision behave unethically, it is the supervising 
professional counselor who bears responsibility for their misactions. 

3. Informed Consent in Assessment 

E.3.a. Explanation to Clients. Prior to assessment, counselors explain the nature 
and purposes of assessment and the specific use of results by potential recipients. The 
explanation will be given in the language of the client (or other legally authorized 
person on behalf of the client), unless an explicit exception has been agreed upon in 
advance. Counselors consider the clients personal or cultural context, the level of the 
client's understanding of the results, and the impact of the results on the client. . . . 

E.3. b. Recipients of Results. Counselors consider the examinee's welfare, explicit un- 
derstandings, and prior agreements in determining who receives the assessment re- 
sults. Counselors include accurate and appropriate interpretations with any release of 
individual or group assessment results. . . . 

E.4. Release of Data to Qualified Professionals 

Counselors release assessment data in which the client is identified only with the con- 
sent of the client or the client's legal representative. Such data are released only to per- 
sons recognized by counselors as qualified to interpret the data. . . . 

Informed consent implies that the person granting permission understands ex- 
actly what assessments will be conducted, why the assessments are being conducted, 
what will happen to the results, and who will be given the results. Confidentiality is 
the cornerstone of counseling and is critical to the area of assessment, particularly 
when the assessment concerns very personal questions or asks for sensitive informa- 
tion. Frequently, permission to conduct assessments requires signed, informed con- 
sent from either the client or, in the case of a minor child, the parent or legal 
guardian. The legitimacy of informed consent rests upon three essential fleets: capac- 
ity, comprehension, and voluntariness. Capacity refers to the right one holds to con- 
sent. For example, precious few circumstances exist that would allow a 9-year-old 
boy the right to consent to anything. This is because in the United States, the par- 
ent or legal guardian almost always holds this right. Likewise, someone who has 
mental retardation or is mentally disabled may not have the ability to consent. 
Comprehension means the consenter understands the implications of consent. II the 
evaluator cannot communicate the purpose of the assessment in a language or terms 
the client can understand, consent cannot be obtained. Voluntariness means the as- 



Foundations of Assessment 73 

sessment involves no coercion or duress. As with any ethically conducted research 
study, a client has the right to withdraw from an assessment at any time. 

The Family Educational Rights and Privacy Act of 1974 (FERPA) and subse- 
quent amendments govern student records in schools and universities. FERPA man- 
dates that only those persons with a legitimate educational interest have the right to 
access a student's records, including assessment information, and that psychological 
evaluations and some other assessments and surveys require signed, informed con- 
sent. In school settings, it may be clearer who has a legitimate need to access a stu- 
dent's assessment results, but there may also be more professionals involved due to 
the number of support staff and teams operating within schools. Professional coun- 
selors should ensure that the persons with whom assessment results are shared, 
whether in the clinic or at school team meetings, have a legitimate need to know the 
results and are fully capable of understanding the results. Professional counselors 
must also safeguard the maintenance of assessment protocols and results. Under nor- 
mal circumstances, protocols and raw interview data are released only with client 
permission and only to professionals who can understand and use the information 
to make decisions in the best interest of the client. 

The same limits to confidentiality that exist within the counseling relationship 
also exist within the assessment area unless informed consent is provided. The client 
(or parent or guardian of a minor) always has the right to request in writing that in- 
formation be shared. Professional counselors must be aware that assessment infor- 
mation is subject to court orders and subpoenas and duty-to-warn situations. In ad- 
dition, sharing information with third parties (e.g., insurance companies); allowing 
clerks, secretaries, and other personnel to handle assessment information; and con- 
sultation are all legitimate limitations to confidentiality. 



Think About It 2.2 What makes confidentiality and informed consent 
such important aspects of assessment? 



E.5. Diagnosis of Mental Disorders 

E.5.a. Proper Diagnosis. Counselors take special care to provide proper diagnosis of 
mental disorders. Assessment techniques (including personal interview) used to deter- 
mine client care (e.g., locus of treatment, type of treatment, or recommended follow- 
up) are carefully selected and appropriately used. 

E.5. b. Cultural Sensitivity. Counselors recognize that culture affects the manner in 
which clients' problems are defined. Clients' socioeconomic and cultural experiences 
are considered when diagnosing mental disorders. . . . 

E.5.c. Historical and Social Prejudices in the Diagnosis of Pathology. Counselors 
recognize historical and social prejudices in the misdiagnosis and pathologizing of 
certain individuals and groups and the role of mental health professionals in perpet- 
uating these prejudices through diagnosis and treatment. 

E.5.d. Refraining from Diagnosis. Counselors may refrain from making and/or re- 
porting a diagnosis if they believe it would cause harm to the client or others. 



74 Chapter 2 



Standard 8.8 of the Standards for Educational and Psychological Testing (AERA et 
al., 1999) advises that the least stigmatizing label should always be assigned when re- 
porting test results. This does not mean that a less serious code is used, but rather the 
diagnosis should be an appropriate one and described precisely. Contextual factors 
(e.g., the client's cultural or socioeconomic experiences) must be considered when 
diagnosing clients because of the significant impact diagnostic labels can have on a 
client's life (Whiston, 2005). In some cases, the diagnostic code drives treatment pro- 
tocols and/or payment for treatment. This factor presents a serious dilemma for 
many practitioners, as the specified number of sessions for one diagnostic code may 
be insufficient to adequately assist the client, while a different code would allow a 
sufficient number of sessions. Still, the Code of Ethics requires that professional coun- 
selors use the proper diagnosis. A great deal of research is currently under way ex- 
ploring the congruence of diagnoses across diverse populations. For example, the 
context of living in a low-socioeconomic inner-city neighborhood may elevate the 
number of criteria for Conduct Disorder the average adolescent male may meet. But 
if these behaviors have become "normative" due to context, is it equitable that the 
diagnosis of Conduct Disorder is made at a substantially increased rate for these 
inner-city youth? Or should a more culture-normative, context-sensitive process be 
pursued? This question is becoming critically important and will likely receive 
tremendous attention in the coming years. 

E. 6. Instrument Selection 

E.6.a. Appropriateness of Instruments. Counselors carefully consider the validity, 
reliability, psychometric limitations, and appropriateness of instruments when select- 
ing assessments. 

E.6.b. Referral Information. If a client is referred to a third party for assessment, 
the counselor provides specific referral questions and sufficient objective data about 
the client to ensure that appropriate assessment instruments are utilized. . . . 

E.6.c. Culturally Diverse Populations. Counselors are cautious when selecting as- 
sessments for culturally diverse populations to avoid the use of instruments that lack 
appropriate psychometric properties for the client population. . . . 

Professional counselors should choose assessments that are the most appropriate 
for the targeted purpose of the assessment and for the clients they are assessing 
(Anastasi & Urbina, 1997). Doing so may involve a thorough search and evaluation 
of potential assessment instruments. According to Wickwire (2002), this step is es- 
sential, as the "professional is seeking an appropriate and workable fit, with the high- 
est quality and greatest benefit" (p. 8). The implication of "fit" for clients from di- 
verse populations is particularly important. Professional counselors must explore 
each instrument's psychometric properties and ensure its appropriateness and use- 
fulness for clients from diverse cultures. 

E.7. Conditions of Assessment Administration . . . 

E.7.a. Administration Conditions. Counselors administer assessments under the 
same conditions that were established in their standardization. When assessments are 
not administered under standard conditions, as may be necessary to accommodate 



Foundations of Assessment 75 

clients with disabilities, or when unusual behavior or irregularities occur during the 
administration, those conditions are noted in interpretation, and the results may be 
designated as invalid or of questionable validity. 

E.7.b. Technological Administration. Counselors ensure that administration pro- 
grams junction properly and provide clients with accurate results when technological 
or other electronic methods are used for assessment administration. 

E.7.c. Unsupervised Assessments. Unless the assessment instrument is designed, in- 
tended, and validated for self-administration and/or scoring, counselors do not per- 
mit inadequately supervised use. 

E.7.d. Disclosure of Favorable Conditions. Prior to test administration of assess- 
ments, conditions that produce most favorable assessment results are made known to 
the examinee. 

The previous discussion has concerned the need for care in the selection of as- 
sessment tools. Equal care must be taken with the use of these tools and the ad- 
ministration of all assessments in order to achieve the optimal result. Changing the 
way in which assessments are given or the conditions under which they are given 
may negate the usefulness and validity of the results. Professional counselors must 
be sensitive to conditions that may affect assessment performance (Anastasi & 
Urbina, 1997). This awareness is particularly important when some clients are ad- 
vantaged by having access to experiences or information about how to perform bet- 
ter on a test — sometimes referred to as test sophistication. Certainly an individual 
who takes a standardized test and has had multiple exposures to sample test ques- 
tions and the "bubble" response format (i.e., penciling in answers on a machine- 
scored form) will have advantages over someone who doesn't know what to expect 
or how to respond appropriately ahead of time. Professional counselors seek to 
"level the playing field" by ensuring that all students have requisite information 
and skills. 

E.8. Multicultural Issues/ Diversity in Assessment 

Counselors use with caution assessment techniques that were normed on populations 
other than that of the client. Counselors recognize the effects of age, color, culture, 
disability, ethnic group, gender, race, language preference, religion, spirituality, sex- 
ual orientation, and socioeconomic status on test administration and interpretation, 
and place test results in proper perspective with other relevant factors. . . . 

According to recent projections, the United States racial population will ap- 
proach 50% non-White by the year 2050. Communities and schools are becoming 
increasingly diverse. In some schools, the number of different languages spoken ex- 
ceeds 1 50. This increasing diversity poses serious concerns for assessment if profes- 
sional counselors are to behave ethically. Diversity concerns are discussed in depth 
later in this chapter. For now, it is important to understand that it is the burden of 
test authors to demonstrate that the test scores are not affected by diverse examinee 
characteristics. In the absence of a declarative statement by test authors in this re- 
gard, the examiner should assume that cultural differences may exist and approach 
use of the test with culturally diverse clients with caution. 



76 Chapter 2 



E.9. Scoring and Interpretation of Assessments 

E.9.a. Reporting. In reporting assessment results, counselors indicate reservations 
that exist regarding validity or reliability due to the circumstances of the assessment 
or the inappropriateness of the norms for the person tested. 

E.9.b. Research Instruments. Counselors exercise caution when interpreting the re- 
sults of research instruments not having sufficient technical data to support respon- 
dent results. The specific purposes for the use of such instruments are stated explicitly 
to the examinee. 

E.9.c. Assessment Services. Counselors who provide assessment scoring and inter- 
pretation services to support the assessment process confirm the validity of such in- 
terpretations. They accurately describe the purpose, norms, validity, reliability, and 
applications of the procedures and any special qualifications applicable to their use. 
The public offering of an automated test interpretations service is considered a 
professional-to-professional consultation. The formal responsibility of the consul- 
tant is to the consultee, but the ultimate and overriding responsibility is to the 
client. . . . 

Professional counselors are ultimately responsible for the accuracy of the as- 
sessment results and must make every effort to ensure that their services are used 
appropriately (Salvia & Ysseldyke, 2004) and that the best interest of the client is 
served. This is equally true when using computerized interpretive programs. While 
information derived from an interpretive report is often accurate and helpful, pro- 
fessional counselors realize that these interpretations are based on statistical mod- 
els and that the software author has never met the client. Thus, as is always the 
case, professional counselors validate and supplement all scores and interpretation 
with additional information from multiple sources before making decisions that 
affect clients' lives. 

Also, while professional counselors strive to administer tests exactly as specified, 
mistakes and outside interference do occur. Professional counselors document these 
circumstances and consider them when interpreting test scores. If the circumstances 
are serious enough to invalidate the test scores, professional counselors state such and 
then do not use the invalid scores to describe client performance or make decisions affect- 
ing a client's life. If the professional counselor has any reservations about the assess- 
ment results, it is the responsibility of the counselor to communicate those reserva- 
tions to the client and/or other appropriate parties, such as parents. The professional 
counselor must ensure that accurate and appropriate interpretations accompany the 
dissemination of any assessment results so that the recipients of the information are 
clear as to what the results actually are. 

E. 10. Assessment Security 

Counselors maintain the integrity and security of tests and other assessment tech- 
niques consistent with legal and contractual obligations. Counselors do not appro- 
priate, reproduce, or modify published assessments or parts thereof without acknowl- 
edgment and permission from the publisher. 



Foundations of Assessment 77 

E.ll. Obsolete Assessments and Outdated Results 

Counselors do not use data or results from assessments that are obsolete or outdated 
for the current purpose. Counselors make every effort to prevent the misuse of obso- 
lete measures and assessment data by others. 

E. 12. Assessment Construction 

Counselors use established scientific procedures, relevant standards, and current pro- 
fessional knowledge for assessment design in the development, publication, and uti- 
lization of educational and psychological assessment techniques. 

Professional counselors must preserve the integrity of the assessments and the 
accompanying protocols. Testing materials should be stored in a locked facility to 
prevent theft or misuse by unauthorized individuals. All published tests are copy- 
right protected and cannot be photocopied for use with clients. Tests are very expen- 
sive to develop, norm, and print. Development of these products is done through fi- 
nancial risks by authors and publishers. For those professional counselors who are 
involved with the development of assessments, it is important to adhere to current 
scientific standards and methodology. Among numerous sources, the RUST-3 state- 
ment (AACE, 2003a) and the Standards for Educational and Psychological Testing 
(AEPvA et al., 1999) are important to consult when developing tests. 

If the assessment information is outdated, professional counselors must take care 
with its use, as the validity and usefulness of the information may be questionable. 
In brief, professional counselors should discontinue use of older versions of tests, and 
cease using them to make client decisions. However, it is not always easy to make 
this call. Previous versions of tests often have a rich research base and numerous stud- 
ies exploring psychometric integrity. Also, it is, unfortunately, not unusual for new 
norms and new test manuals to have errors. Thus it is often prudent to phase in use 
of new instruments and to use the new instrument exclusively once its quality has 
been established. 

E. 13. Forensic Evaluation: Evaluation for Legal Proceedings 

E.13.a. Primary Obligations. When providing forensic evaluations, the primary 
obligation of counselors is to produce objective findings that can be substantiated 
based on information and techniques appropriate to the evaluation, which may in- 
clude examination of the individual andl or review of records. Counselors are entitled 
to form professional opinions based on their professional knowledge and expertise that 
can be supported by the data gathered in evaluations. Counselors will define the lim- 
its of their reports or testimony, especially when an examination of the individual 
has not been conducted. 

E. 13b. Consent for Evaluation. Individuals being evaluated are informed in writ- 
ing that the relationship is for purposes of an evaluation and is not counseling in na- 
ture, and entities or individuals who will receive the evaluation report are identi- 
fied. Written consent to be evaluated is obtained from those being evaluated unless a 
court orders evaluations to be conducted without the written consent of individuals 



78 Chapter 2 



being evaluated. When children or vulnerable adults are being evaluated, informed 
written consent is obtained from a parent or guardian. 

E.13.C. Client Evaluation Prohibited. Counselors do not evaluate individuals for 
forensic purposes they currently counsel or individuals they have counseled in the past. 
Counselors do not accept as counseling clients individuals they are evaluating or in- 
dividuals they have evaluated in the past for forensic purposes. 

E.13.d. Avoid Potentially Harmful Relationships. Counselors who provide foren- 
sic evaluations avoid potentially harmful professional or personal relationships with 
family members, romantic partners, and close friends of individuals they are evalu- 
ating or have evaluated in the past. 

Forensic evaluation and court testimony is a burgeoning specialty within coun- 
seling, psychology, and psychiatry. The standard regarding avoidance of potentially 
harmful relationships is a new addition to the 2005 Code of Ethics and seeks to make 
sure that professional counselors understand the importance of making inferences 
based on firsthand knowledge of the client, rather than speculation or generalities. 
Professional counselors can expect much more attention to this area of study in the 
future because of the increasing need of courts, lawyers, and those accused of crimes 
to have mental health experts provide testimony regarding psychological status. Also, 
this is another issue that psychological boards across the country are pursuing in 
order to attempt to limit the scope of professional counselors' practice. 

Source: Section E of the ACA Code of Ethics and Standards of Practice has been 

reprinted with permission. No further reproduction is authorized without 

written permission from the American Counseling Association. 



Think About It 2.3 When assessments are conducted with clients and 
students, it is essential that results be used correctly. What are some conse- 
quences of inappropriate use? How could these problems be resolved? 



While thev4G4 Code of Ethics is helpful in describing ethical test use, the reader 
is again referred to the RUST-3 statement for a comprehensive and explanatory trea- 
tise of responsible, professional test use. Assessment information, used in conjunc- 
tion with other sources of information about the client, can be extremely useful in 
working with clients. As can be seen from this discussion, it is critically important for 
professional counselors to practice ethically in order to do no harm. But what should 
a professional counselor do if unsure of the correct ethical course of action? For an- 
swers, we now turn to a brief discussion of ethical decision making as applied to as- 
sessment issues. 



ETHICAL DECISION MAKING 



One of the greatest professional challenges facing most counselors is ethical behav- 
ior — that is, determining the ethically appropriate course of action in any situation. 
Professional counselors must also be acutely aware of the behavior of their colleagues 



Foundations of Assessment 79 

and have a responsibility to act if a colleague is behaving in an unethical manner. To 
assist professional counselors with these issues, the ACA's Ethics Committee devel- 
oped the Practitioner's Guide to Ethical Decision Making (Forester-Miller & Davis, 
1996), which delineates a seven-step model for working through ethical dilemmas: 

1. Identify the problem. One should gather all relevant information and determine 
whether the problem is an ethical issue or a legal, practice, or other issue. If it is 
an ethical issue, continue with the process. 

2. Apply the ACA Code of Ethics (2005a). Determine which section of the ACA 
Code of Ethics addresses the issue most directly. The relevant section may outline 
the course of action to follow. If the answer is not indicated, then one should 
proceed to the next step of the model. 

3. Determine the nature and dimensions of the dilemma. Forester-Miller and Davis 
suggested that professional counselors should consider the moral principles that 
underlie the Code of Ethics for direction, current research, and consultation to 
determine an appropriate course of action. 

4. Generate potential courses of action. Professional counselors should consult at least 
one colleague to ensure that all potential courses of action are identified. 

5. Consider the potential consequences of all options and determine a course of action. 
The impact of potential consequences on the client, professional counselor, and 
others should be considered in determining which option is optimal for address- 
ing the dilemma. 

6. Evaluate the selected course of action. Evaluate the selected course of action to en- 
sure that implementing that choice will not create new or additional ethical 
dilemmas. 

7. Implement the course of action. The professional counselor should implement the 
selected course and follow up to ensure that the selected action had the desired 
outcome. 

The following scenario highlights the use of the ethical decision-making model 
in practice for an assessment-related issue. The Student Services Team (SST) at 
Happy Days Middle School meets once a month to discuss students who are expe- 
riencing problems that are interfering with their ability to be successful academically 
or socially in school. Ms. Jones is a licensed professional counselor who works in the 
school-based mental health center and routinely attends the SST meetings as a team 
member. A student new to the school who was experiencing both academic and so- 
cial difficulty was referred for assessment. At the meeting the next month, the results 
of the student's assessment were presented and discussed. Ms. Jones reviewed the as- 
sessment results and had a number of concerns. In particular, she questioned 
whether the assessments used were appropriate for the student, wanted to know why 
an older version of the WISC had been used, and also questioned whether the per- 
son administering the assessments (the learning disabilities teacher) was qualified to 
do so. When she tried to raise these issues, the SST members ignored her concerns 
and agreed to change the student's program based on the assessment results and an- 
ecdotal information. 

Ms. Jones believed that this situation was an ethical dilemma and therefore used 
the ethical decision-making model. She first identified the problem and then applied 



80 Chapter 2 



the ACA Code of Ethics. In this case, she identified three problems and the applica- 
ble sections of the Code of Ethics: the use of obsolete and inappropriate assessment in- 
struments (E.6.a), the competence of the person administering the assessments 
(E.2.a), and the use of the assessment results in placement (E.2.b). To determine the 
nature and dimensions of the issue, she went to her supervisor to discuss her con- 
cerns. Since she is not employed by the school system, she wanted to make sure that 
she was considering all facets of the situation and recognized that perhaps there were 
processes in the schools she did not understand. 

Ms. Jones concluded that the problems she had identified were ethical dilemmas 
in this case and suspected that they might also exist in other cases as well. She then 
determined possible courses of action. Ms. Jones's supervisor identified a supervisor 
in the school system with whom Ms. Jones could discuss her concerns. Ms. Jones 
also thought about going back to the team and discussing her concerns again, and 
also talking with the person who performed the assessment to determine why these 
particular assessments were used and what credentials the assessor held. After consid- 
ering all options and their potential consequences, Ms. Jones chose to speak to the 
assessor. She felt this was particularly important since the Code of Ethics also indi- 
cates that if one is concerned about the ethical behavior of a colleague, the first step 
is to discuss the concern directly with the colleague, even one who is not a counselor 
bound to uphold the ACA Code of Ethics. 

Through Ms. Jones's discussion with the assessor, it became clear to her that the 
assessor lacked the experience and training to conduct an assessment using current 
tools and that the school system had not purchased current versions of assessments 
and had not provided appropriate professional development for the staff. Ms. Jones 
then went to the school system supervisor to discuss her concerns. As a result of this 
discussion, the school system recognized the need to change some of its practices, 
and the assessments for the student in question were redone by a qualified assessor 
using current tests. 

As Ms. Jones discovered, professional counselors must continually review their 
behavior and that of their colleagues to ensure that the best interests of the client al- 
ways come first, that their practice reflects current best practices, that they use and/or 
interpret only those assessments for which they are trained, and that the assessments 
chosen are appropriate for the client and the intended purpose. 



LEGAL ISSUES IN ASSESSMENT 



While ethical issues in assessment are important, professional counselors must be 
even more aware of important legal rulings. Ethical codes represent high standards 
of professional practice; however, laws must be followed, even if they conflict with 
ethical standards. Both federal and state legislatures enact legislation that impacts the 
way professional counselors must practice. Local boards of education, state and local 
agencies, and other organizations also implement regulations and policies that im- 
pact counseling practice. While not the same as laws, regulations and policies gov- 
ern the practices of the professionals to whom they pertain. For example, a licensed 
professional counselor (IPC.) who violates a state regulation can be cited or even 



Foundations of Assessment 81 

sanctioned. A professional school counselor who violates a school board policy can 
be reprimanded or even terminated for cause. These steps can be taken because pro- 
fessionals who are licensed, certified, or employed are frequently required to abide by 
such regulations as a condition of licensure, certification, or employment. 

While the purpose of laws is not specifically to direct or limit assessment, they 
have been enacted to protect the rights of clients, students, parents, and employees, 
and therefore influence how assessment may or must be conducted. Case law is the 
result of litigation or court cases and often does direct how professional counselors 
must practice. Professional counselors need to keep current with legislation and 
court cases, as this is an ever-changing area. Some of the major legal issues affecting 
assessment are reviewed in the rest of this section. 



The Family Educational Rights and Privacy Act of 1974 (FERPA) 
and Related Legislation 



Prior to the 1970s, educators and researchers frequently conducted assessments with- 
out parental consent and often stored these assessments in student files. In addition, 
access to student files was virtually unlimited; a simple request to the principal was 
often enough to get access to a student's files by entities, professionals, and employ- 
ers outside of a school system. The Family Educational Rights and Privacy Act of 
1974 (FERPA) is the federal law that protects the privacy of all student records in 
schools and institutions of higher learning. Often referred to as the Buckley 
Amendment, this law has several provisions and applies to all pre-K-12 and postsec- 
ondary institutions that receive federal funding from the U.S. Department of 
Education for any program. Nonpublic schools that do not accept federal funding 
are exempt from these regulations. 

FERPA defines education records as all information a school collects for atten- 
dance, achievement, group and individual testing and assessment, behavior, and 
school activities. FERPA gives parents specific rights regarding this information. The 
first provision is that parents have the right to inspect and review their children's 
records. Each school system must annually send a notice to parents detailing this re- 
view process and the procedure for filing a complaint if they disagree with anything 
in the record. The school system has 45 days in which to comply with the parents' 
request to review the record and faces penalties, including the loss of all applicable 
federal funding, for failure to comply. Second, the law limits who may access records. 
Under FERPA, only those persons with a "legitimate educational interest" can ac- 
cess a student's record. Some personally identifiable information may be released 
without parental consent. This information is usually referred to as directory informa- 
tion, or public information, and generally includes such material as the student's 
name, address, telephone number, date and place of birth, honors and awards, and 
attendance records. The major exemption to the confidentiality of student records 
relates to law enforcement issues. The school must comply with a judicial order or 
lawfully executed subpoena. In cases of emergency, information about the student 
relevant to the emergency can be released without parental consent (see www.ed 
.gov/print/policy/gen/guid/fpco/ferpa/index.html for details). All states and local 



82 Chapter 2 



jurisdictions have incorporated FERPA's requirements into state statutes and local 
policies with some degree of variance among specifics, such as directory information. 

The rights of consent transfer to students upon their 18th birthday. The law 
does not specifically limit the rights of parents whose children are over the age of 18 
and continue to attend a secondary school (i.e., high school). The law also does not 
specifically limit parental rights for a student who attends a postsecondary institu- 
tion but is older than 18 years, although most institutions of higher learning adhere 
to a policy of informed consent for a student who is 18 years or older. Noncustodial 
parents have the same rights as custodial parents, unless a court order has limited or 
terminated the rights of one or both parents. Stepparents and other family members 
who do not have legal custody of the child have no rights under FERPA without 
court-appointed authority. 

The Protection of Pupil Rights Amendment of 1978 (PPRA), often referred to 
as the Hatch Amendment or the Grassley Amendment, for the members of Congress 
who introduced it, gives parents additional rights with regard to surveying minor 
students. PPRA does not apply to postsecondary schools. If the survey is funded with 
federal money, informed consent must be obtained for all participating students if 
students are required to take the survey and if questions about particular personal 
areas are asked. PPRA also requires informed parent consent for any psychological, 
psychiatric, or medical examination, testing, or treatment of students or any school 
program designed to affect the personal values or behavior of students. PPRA also 
gives parents the right to review instructional materials in experimental programs. 

The No Child Left Behind Act of 2001 includes several changes to FERPA and 
PPRA (see www.ed.gov/about/offices/list/index.html for specific details). The 
changes apply to surveys funded in whole or part by any program administered by 
the U.S. Department of Education (USDE). PPRA (20 U.S.C. 1232h) requires that 
schools and contractors make instructional materials available for review by parents 
of participating students if those materials will be used in any USDE-funded survey, 
analysis, or evaluation and that schools and contractors obtain written parent con- 
sent prior to the participation of minor children in any USDE-funded survey, analy- 
sis, or evaluation if information in any of the following areas would be revealed: 

■ Political affiliations or beliefs of the student or parent 

■ Mental and psychological problems of the student or family 

■ Sex behavior or attitudes 

■ Illegal, antisocial, self-incriminating, or demeaning behavior 

■ Critical appraisals of other individuals with whom respondents have close fam- 
ily relationships 

■ Legally recognized privileged or analogous relationships, such as those of lawyers, 
physicians, and ministers 

■ Religious practices, affiliations, or beliefs of the student or the students parent 

■ Income other than such information required to determine eligibility/participa- 
tion in a program 

These new provisions of PPRA also apply to any survey that is not funded in any 
way with USDE money. Under these provisions, parents have the right to inspect, 



Foundations of Assessment 83 

upon request, any survey or instructional materials used as part of the curriculum cre- 
ated by a third parry if one or more of the eighr above-outlined areas are involved. 
Parents also have the right to inspect any instrument used to collect personal informa- 
tion from students for marketing or selling. Parents may opt their child out of this 
data collection process or any survey involving one or more of the eight above- 
delineated areas. PPRA does not apply to any survey that is administered as part of the 
Individuals with Disabilities Education Improvement Act of 2004 (IDEIA). 

As can be ascertained from the explanation of FERPA and PPRA, there are 
many constraints to assessment, testing, and surveys in public schools. As each 
school district may further define policies involving this legislation, it is critical for 
professional counselors to become familiar with what types of assessments fall under 
these regulations, how the assessment results may be used or disseminated, and to 
whom. It has become increasingly difficult for professional counselors to give any 
type of formal or informal assessment to students without informed parent consent. 
And other school mental health professionals, such as school psychologists, may have 
even more restrictions placed on their ability to conduct any form of assessment 
without signed, informed parent consent. One assessment issue that is becoming 
more problematic in schools concerns the desire of parents to review the actual pro- 
tocol used after their child has completed the assessment. The problem revolves 
around the issue of whether the actual assessment forms become part of the educa- 
tional record or just the results. Most professional associations believe that the actual 
protocol is not part of the record and that parents usually lack the training to com- 
pletely understand the assessment tools. 

FERPA, PPRA, NCLB, and related legislation all have provisions aimed at pro- 
tecting the rights of school-aged children and their parents from the collection of in- 
formation that violates the privacy of all students. Additional provisions have also 
been put in place to protect the rights of handicapped citizens; these provisions are 
discussed in the section on IDEIA later in this chapter. 



Minimal Competency Assessment and 
the No Child Left Behind Act of 2001 



"High-stakes" testing has been used in education for years, starting with the initial 
premise that all students should master the basics of a curriculum before being 
granted a diploma. Such a premise has tremendous support among adults in the 
United States, but establishing minimal competency for graduation has a controver- 
sial sociopolitical dimension. 

In the 1970s, many states began to develop minimal competency tests as a re- 
quirement for graduation. The Debra v. Turlington (1979) case questioned in the 
Florida state courts the Florida State Assessment Test. Lawyers for 10 African American 
students who had been denied diplomas on the basis of their failure to pass the state 
assessment examination argued that the test was discriminatory because the students 
had been educated in a segregated system and had not acquired the skills that would 
have allowed them to pass the test. The judge ruled that the test was not discrimina- 
tory but did suspend its use for four years and directed that the school must show the 



84 Chapter 2 



assessment covered only information taught. While the intent behind minimal com- 
petency assessment was noble, educators and legislators soon realized that such a sys- 
tem revolved around low expectations rather than a striving for higher standards. 

The discussions of higher-standards-based education led to implementation of 
the No Child Left Behind Act of 2001 and its requirements for high-stakes testing 
and accountability. A high-stakes test is any test that results in a decision about a stu- 
dent or school that can change a student's or school's status (e.g., graduation from 
high school; admittance into a college; and a school that comes under State 
Department of Education oversight for poor performance). Almost all states now re- 
quire students to pass tests as part of high school graduation requirements. In addi- 
tion, students are assessed at identified grade levels from 3rd grade through high 
school to meet the requirements of the No Child Left Behind Act. Both students 
and schools are feeling the increased pressure to perform well on the assessments, lest 
the school fail to meet annual yearly progress for five years in a row and risk being 
reconstituted (i.e., being put under external control, leading to the possible replace- 
ment of administration, staff, curriculum, etc.). Many laud the intent of ensuring 
that all children learn and achieve to high academic levels. However, many educators 
are also concerned that the focus on assessment competes with the focus on learning. 

Numerous professional organizations have weighed in on the high-stakes testing 
issue. The American Counseling Association (ACA) appointed a Task Force on 
High-Stakes Testing in 2003 and some of the areas considered by this task force are 
particularly noteworthy. In a position statement adopted by the ACA Governing 
Council (ACA, 2005b), the task force recognized the importance of assessment and 
accountability and its relationship to high achievement. (This position statement 
may be found on the companion website for this text, in the chapter on high-stakes 
testing.) High-stakes testing (HST) is one objective means of assessing student per- 
formance, and HST assessments are generally well developed. However, the task 
force specified some important cautions. Using a single test score resulting from a 
group administration of the test to make decisions about individual students has in- 
herent problems; many students are at a disadvantage on HST, and the results may 
not accurately reflect their abilities. The task force points out that special education 
law does not allow decisions to be made about children based on a single test, but the 
accountability provisions of HST do allow this type of decision making. While ac- 
countability remains a major requirement for schools and school systems, it must be 
balanced with providing assessment tools for students that truly assess what they 
should know in a way that maximizes student performance and reflects best prac- 
tices in assessment. 



Individuals With Disabilities Education Improvement Act 
of 2004 (IDEIA) and Related Legislation 



The Education for All Handicapped Children Act, also known .is PL 94-142, was 
initially enacted in 1975 alter a long struggle to equalize the opportunities for dis- 
abled students and to provide opportunities similar lo those ot their nonhandi- 
capped peers through a tree, appropriate education in the least restrictive environ 



Foundations of Assessment 85 

ment. This special education law has been reauthorized several times since its enact- 
ment, renamed the Individuals With Disabilities Education Act (IDEA) in 1990, 
and most recently signed by President Bush on December 3, 2004 as the Individuals 
With Disabilities Education Improvement Act (IDEIA). The bill outlines the 
process for referring, assessing, identifying, placing, and instructing students with 
handicapping conditions who warrant additional services under the law. The law re- 
quires that all decisions are made by a multidisciplinary team that includes the par- 
ents, special educator, regular educator, school system representative, and frequently 
the professional school counselor and school psychologist. Parental consent is re- 
quired for assessment and placement activities. The multidisciplinary team makes all 
placement and educational decisions; each eligible child is required to have an 
Individual Education Plan (IEP), which outlines the goals for the child and the serv- 
ices that will be provided. 

Part B, Section 614 (2) (3) of IDEIA outlines the requirements for conducting 
the evaluation to determine if a child has a handicap. It states that the local educa- 
tion agency (i.e., school system) shall 

■ use a variety of assessment tools and strategies to gather relevant functional, de- 
velopmental, and academic information, including information provided by the 
parent, that may assist in determining if the child is a child with a disability and 
the content of the IEP; 

■ not use any single measure or assessment as the sole criterion for determining 
whether a child is a child with a disability or determining an appropriate educa- 
tional program for the child; 

■ use technologically sound instruments that may assess the relative contribution 
of cognitive and behavioral factors, in addition to physical or developmental fac- 
tors; and 

■ ensure that assessments and other evaluation materials used to assess the child 

■ are selected and administered so as not to be discriminatory on a racial or 
cultural basis; 

■ are provided and administered in the language and form most likely to yield 
accurate information; 

■ are used for purposes for which the assessments or measures are valid and 
reliable; 

■ are administered by trained and knowledgeable personnel; and 

■ are administered in accordance with any instructions provided by the pro- 
ducer of such assessments. 

The above language clearly delineates requirements that are actually best prac- 
tices in assessment and which are discussed earlier in this chapter and in other chap- 
ters of this book. This reauthorization of the law strengthened the development of 
new approaches to determine whether students are learning disabled that are not 
based solely on the IQdiscrepancy model (see Chapter 12, Table 12. 1). Additionally, 
the law focuses on addressing the problem of the over- and misidentification of lin- 
guistic and cultural minority students and directs districts with significant over rep- 
resentation of minorities to create and operate programs to reduce this problem (see 
www.cec.sped.org/law_res/doc/law/index.php or further details). 



86 Chapter 2 

The Health Insurance Portability and Accountability Act 

of 1996 (HI PA A) 



Privacy issues of the general citizenry regarding medical and mental health fields are 
of critical importance. The rise of managed care, frequent switching of health insur- 
ance plans by employers, and the sensitive nature of questions frequently asked by 
these entities often lead to privacy concerns. The Health Insurance Portability and 
Accountability Act of 1996 (HIPAA) required that the U.S. Department of Health 
and Human Services (HHS) adopt national standards for the privacy of individually 
identifiable health information, outlined patients' rights, and established criteria for 
access to health records. Included in this law was a provision that HHS must adopt 
national standards for electronic healthcare transactions. In response to this man- 
date, regulations named the Privacy Rule were adopted in 2000 and became effec- 
tive in 2001. This rule set national standards for the protection of health informa- 
tion as it applied to health plans, health clearing houses, and healthcare providers 
who conduct transactions electronically. All covered entities had until April 14, 
2003, to comply with the Privacy Rule (see http://www.hhs.gov/ocr/hipaa for fur- 
ther details). 

The HIPAA Privacy Rule has a number of provisions, including giving patients 
the right to obtain and examine a copy of their health records and request correc- 
tions, allowing patients some ability to control the uses and disclosures of their 
health information, allowing patients to know how their information might be used 
and if disclosures have been made, setting limits on the use and release of health 
records, and providing a complaint process. The Privacy Rule also requires that 
providers give clients a privacy notice and should obtain a signed acknowledgement 
of this notice. 

States and health entities continue to work on the details of the implementation 
of HIPAA. Clearly, it has implications for professional counselors, particularly those 
who work in health settings, clinics, agencies, and private practice. Professional 
counselors must be aware of this law and its requirements and ensure that their prac- 
tices are in accordance with its provisions. Importantly, the laws apply whether the 
client is a self-payer or the professional counselor receives payment through insur- 
ance companies or health organizations. Professional counselors should also be sure 
to adhere to HIPAA provisions when client information is shared. 

HIPAA protects health information much the same way FERPA protects stu- 
dent records and information. While the USDE has indicated that FERPA will con- 
tinue to regulate student information in schools, the schools are finding that HIPAA 
has complicated the process. Schools frequently depend on assessments conducted 
by nonschool providers, particularly lor handicapped students, who are regulated by 
HIPAA. In past years the assessments and health information would routinely be- 
come part of the child's educational record. What schools are now finding is often 
documents are stamped with "do not redisclose" or other indications that informa- 
tion should not be made a permanent part of the educational record of the child and 
must be returned to the assessor if the child leaves the school. As healthcare providers 
and patients become more aware of the requirements of HIPAA, these issues will 
likely be resolved. 



Foundations of Assessment 87 

It should be noted that the mandates of HIPAA are consistent with ethical stan- 
dards and therefore should not be a barrier to sound professional practice. Signed, 
informed consent; limits to disclosure; and the confidentiality of patient informa- 
tion are all part of the ethical standards and should drive the practice of professional 
counselors. 

Guidelines of the Equal Employment Opportunity Commission (EEOC) 

According to Kaplan and Saccuzzo (2001), the government exercises its power to 
regulate testing largely through interpretations of the 14th Amendment to the 
Constitution, which guarantees all citizens due process and equal protection under 
the law. This is evidenced by the government's actions concerning personnel prac- 
tices, particularly employee testing. Title VII of the Civil Rights Act of 1964 and its 
subsequent amendments created the Equal Employment Opportunity Commission 
(EEOC), whose guidelines outlaw discrimination in employment based on race, 
color, gender, national origin, religion, pregnancy, gender, age 40 and above, or sta- 
tus as a Vietnam veteran. 

The EEOC developed guidelines for the use of tests and assessments in employ- 
ment practices. The commission was particularly interested in any procedures that 
might have an adverse impact on selection and worked to ensure that tests and as- 
sessments were not used to discriminate based on race. It ruled that any assessment 
used as a basis for employment decisions that adversely affected hiring, promotion, 
transfer, or any other activity protected by the law constituted discrimination unless 
the test was validated for the reason it was being used and the person handling the 
personnel matter could not use other procedures (Drummond, 2000). 

Following the Civil Rights Act, a number of U.S. Supreme Court cases chal- 
lenged the concept of adverse impact and refined employment practices. The first 
landmark case was Griggs v. Duke Power Company (1971). The case involved several 
African American employees of the power company who sued because they felt the 
criteria used for promotion (a high school diploma and two tests) were discrimina- 
tory. In this case, and in the subsequent cases of Albemarle Paper Company v. Moody 
(1975) and Washington v. Davis (1976), the U.S. Supreme Court's decisions placed 
the burden of proof on the employer. The decisions indicated that employment tests 
must be valid and reliable, and forced the employers to define how job performance 
relates to test scores (Kaplan & Saccuzzo, 2001). 

A 1988 U.S. Supreme Court's decision in Watson v. the Fort Worth Bank and 
Trust Company involved an African American woman who was passed over for pro- 
motion for a supervisory position at the bank. She argued that racial minorities were 
underrepresented in selections for higher-level jobs. The court ruled that by adding 
one subjective item to objective tests, employers could protect themselves from dis- 
crimination suits as adverse impact does not apply to subjective criteria. This ruling 
was followed by Wards Cove Packing Company v. Antonio in 1989. This case was filed 
by cannery workers at an Alaskan packing company who claimed that the company 
was keeping them out of higher-paying and more skilled jobs. The U.S. Supreme 
Court refused to hear the case and remanded it back to the lower court. In so doing, 
they noted that the burden of proof should be shifted to the plaintiff to demonstrate 



88 Chapter 2 



that there are problems with selection procedures. This ruling obviously favored em- 
ployers as few employees have the resources and knowledge necessary to prove bias 
in personnel practices. 

As a result of these cases, Congress passed the Civil Rights Act of 1991, which 
incorporated many of the principles of the Griggs v. Duke Power Company case. The 
act placed the burden of proof back on the employer and outlawed differential cut- 
off scores or score adjustments. 



The Americans With Disabilities Act of 1991 (ADA) 



Just prior to the enactment of PL 94-142 in 1975 to address the needs of school- 
aged youth with educational handicaps, Congress passed the U.S. Rehabilitation Act 
of 1973. Section 504 of this act contains important provisions for individuals with 
medical or mental disorders (see Chapter 12 for a fuller discussion of this act). Some 
of the implications of the U.S. Rehabilitation Act of 1973 and related legislation in- 
volved the requirements for access by handicapped citizens to ramps and elevators 
in public buildings, as well as handicapped parking spaces and curbs cut to allow 
wheelchair access. These landmark laws were added to by important new laws in 
the 1990s. The Americans With Disabilities Act (ADA) of 1991 was enacted to re- 
move barriers for persons with physical and mental disabilities to employment, ed- 
ucation, and public services. The law requires that reasonable accommodations must 
be made for persons who are determined to be impaired, including accommodations 
in testing and assessments. The law does not delineate what accommodations are re- 
quired, so they must be determined on a case-by-case basis. There are tremendous 
concerns in the assessment community regarding how to provide accommodations 
and fairly assess individuals with disabilities without compromising the reliability 
and validity of the assessment instruments. Murphy and Davidshofer (2001) sug- 
gested that this issue will occupy test developers for years to come. Of course, the 
implications of the ADA go far beyond assessment. But for now, realize that 
Americans with disabilities are given full protection under the law, and that reason- 
able accommodations must be offered. 



Court Decisions Related to Diversity in Assessment 



Tests have long been used to "sort" and "select." When certain groups are over- or 
underselected for participation in programs, the specters of bias and fairness will cer- 
tainly arise. There have been a number of court cases involving the use of testing in 
education, decisions that have shaped amendments to special education law and 
practice. The first major case that examined the validity of psychological test scores 
was Hobson v. Hansen (1967). Students in the District or Columbia public schools, 
which were integrated, were placed in classes based on the results or group ability 
tests, which resulted in establishment of a de facto segregation or tracking system. 
Hobson, the parent of two children, sued the school system, arguing that African 
American students were tracked into the basic track while White students were 
placed in the honors and other tracks. The U.S. Supreme Court found that ability 



Foundations of Assessment 89 

tests that had been developed on White students could not be used to place African 
American students (Kaplan & Saccuzzo, 2001). Current test development standards 
specify that students about whom decisions will be made must be well represented in 
a test's standardization sample. The Hobson v. Hansen case brought this point home 
very clearly. In fact, present-day test developers owe much of their "commonsense" 
procedures to the issues resolved by early pioneers in test development and civil 
rights cases. 

The case of Diana v. State Board of Education concerned the use of intelligence 
tests for bilingual Mexican American students. The plaintiffs argued that bilingual 
Mexican American students were inappropriately placed in classes for the educable 
mentally retarded (EMR) based on tests that failed to take into account their bilin- 
gual status. The students retested in Spanish all scored too high to meet the EMR 
criteria. An out-of-court agreement established that bilingual students would be 
tested in both English and their native language; that placement in EMR classes 
would be based on both test scores and a comprehensive developmental assessment 
of the child; and that tests that emphasize areas that might be unfair to minority chil- 
dren could not be used for placement. Again, today we see this issue as "common 
sense," but in thel960s and 1970s, almost all tests were published in English. Today, 
greater diversity in languages occurs to test clients. 

Chapter 10 explores the interaction of race and socioeconomic status on intel- 
lectual development, an issue that, however, was not widely studied in the 1960s and 
1970s. The placement of African American students in EMR classes based on IQ 
tests was at the heart of the California case Larry P. v. Riles (1979). The plaintiffs con- 
tended that use of these intelligence tests was invalid for African American students 
and that IQ tests should therefore not be employed for placement purposes. Many 
testing experts testified at the trial, some in support of the validity of IQ tests for 
African American and other children, others in opposition to the use of such tests. 
The judge in the case ruled that the "tests are racially and culturally biased, have a 
discriminatory impact on African American children, and have not been validated 
for the purpose of (consigning) African American children into educationally dead- 
end, isolated, and stigmatizing classes" (Kaplan & Saccuzzo, 2001, p. 580). This de- 
cision was appealed but upheld in 1984. As a result, intelligence tests could not be 
used to place African American students in special education classes. This ban was 
expanded in 1986 to include testing all African American children for special edu- 
cation in California but does not apply to other minority children. Enter the law of 
unintended consequences. Because qualification for special education services re- 
quired assessment of ability (i.e., intelligence), these laws virtually eliminated minor- 
ity students from qualifying for services intended to help them. Some civil rights ad- 
vocates viewed special education as a way to "segregate" minorities within the 
educational system, but the alternative of failing children with disabilities was seen 
as even more egregious. As a result of a subsequent case, Crawford v. Honig, the ban 
on testing African American children was lifted in 1992. Legislation related to test 
bias and discrimination against clients of diverse backgrounds has become a source 
of hot debate within the assessment field, so the final pages of this chapter are dedi- 
cated to an exploration of this essential foundational issue. 



90 Chapter 2 



DIVERSITY ISSUES IN ASSESSMENT 



For years, U.S. Census data have indicated that the United States is becoming in- 
creasingly diverse. The U.S. population is multiracial, multiethnic, and multilingual. 
Approximately 7% of this population reported in the 2000 Census having a disabil- 
ity, and 18% reported living below the poverty level (U.S. Census Bureau, 2003). 

These demographics demand that professional counselors be able to work effec- 
tively with clients and students from a multitude of cultures (Constantine, 2001; 
Lee, 2001). Professional counselors are involved in numerous ways in administering 
and ensuring that clients receive appropriate assessment. This section discusses some 
of the basic aspects of diversity in assessment and provides professional counselors 
with practical steps to approaching fairness in assessment in the clinical and school 
settings. 



Understanding Diversity 



Conversations of diversity often focus on race and ethnicity. Race is an anthropolog- 
ical construct based on the classification of physiological characteristics (Gladding, 
2001) and includes a political and socioeconomic dimension related to differences in 
physical appearance (Brace, 1995; Yee, Fairchild, Weizmann, & Wyatt, 1993). 
Ethnicity is the "group classification in which members believe they share a common 
origin and a unique social and cultural heritage such as language or religious belief" 
(Gladding, 2001, p. 45). While important, these two factors alone do not describe 
the extent of diversity that professional counselors face. 

Culture is another important diversity issue. Culture is both complex and mul- 
tidimensional. Professional counselors may recognize several cultures and subcul- 
tures within a population. Although this adds to the complexity of the construct, 
understanding and appreciating culture and its multidimensionality gives profes- 
sional counselors valuable insight into their clients' sense of self, language and com- 
munication patterns, dress, values, beliefs, use of time and space, relationships with 
family and significant others, food, play, work, and use of knowledge (Whitefield, 
McGrath, & Coleman, 1992). Succinctly, culture can be described as the set of "val- 
ues, beliefs, expectations, worldviews, symbols, and appropriate behaviors of a group 
that provide its members with norms, plans, and rules for social living" (Gladding, 
2001, p. 34). 

Diversity also encompasses gender, sexual orientation, language, socioeconomic 
status, ability, and disability. Diversity simply means difference: difference in the 
many aspects and dimensions used to help understand student development and be- 
havior. The professional counselor must therefore appreciate and understand diver- 
sity in all of its manifestations and its implications for assessment. 

For well over 40 years, the counseling profession has been deeply concerned 
about appropriate assessment for clients and students of diverse populations 
(Anastasi & Urbina, 1997; Sattler, 2001; Whiston, 2005). Some of this discussion 
has resulted from legislation and legal proceedings regarding the specific areas of 
multidisciplinary assessment, assessment in a clients native language, assessment 



Foundations of Assessment 9 1 

used for selection purposes, assessment procedures, informed consent, and due rights 
notification (Rogers, 1998; Sattler 2001). Ethical guidelines also address appropri- 
ate assessment. Beyond global charges to respect diversity and work in the best inter- 
est of students and clients, Section E of the ACA Code of Ethics (2005a) specifically 
addresses diversity in testing. Further direction regarding diversity in assessment, 
however, is delineated in the Association for Assessment in Counseling and 
Education's Standards for Multicultural Assessment (AACE, 2003b). 



Standards for Multicultural Assessment 



Recognizing the importance of multicultural assessment, the Association for 
Assessment in Counseling and Education (AACE) studied and compiled standards 
of many professional organizations. The result was a document outlining 68 compe- 
tencies specific to the assessment and counseling of diverse populations (see 
http://aace.ncat.edu). The competencies cover assessment content and purpose; 
norming, reliability, and validity beyond general standards issues; administration and 
scoring; and interpretation and application of assessment results. Many of the com- 
petencies have significant consequences for professional counselors, psychologists, 
and other diagnosticians involved in psychological assessment and placement 
processes. In addition, professional counselors should be aware of the competencies 
because of their relationship to culturally appropriate counseling and assessment 
services (AACE, 2003b). 



Diversity Factors Involved in Assessment 



Thus far we have outlined the mandate for professional counselors to be aware of 
legal and ethical responsibilities regarding multicultural assessment. Following is a 
more specific discussion of the ways in which diversity factors affect assessment. 

Difference 

Inherent in the concept of diversity is an understanding of difference. Difference 
does not imply better than or worse than. Difference may, however, become ad- 
vantage or disadvantage in the realm of assessment. Imagine that all mental health 
professionals are asked to take an assessment on providing services to clients. 
Professional counselors, psychologists, psychiatrists, family therapists, and social 
workers (among others) gather to take the test. Clearly, each of these groups of pro- 
fessionals differs in training, credentials, experience, and perhaps views of clients. 
No group is better than the other. The groups are different. If the assessment is 
based largely on the Council for Accreditation of Counseling and Related 
Educational Programs (CACREP) (2001) curricular standards using the language 
and orientations of professional counselors, professional school counselors may 
have an advantage on the test. Their scores may be somewhat higher than those of 
psychologists, psychiatrists, family therapists, or social workers. This simplistic ex- 
ample demonstrates how cultural difference and test content can interplay. In more 



92 Chapter 2 



subtle ways (e.g., test words, pictures, format), test content must be examined to 
ensure that information specific to certain cultures is controlled in assessment 
(Rogers, 1998). 

Worldview 

Worldview is a second factor involved in assessment. Every aspect of counseling oc- 
curs in a cultural context. This includes assessment. As a result of cultural context, 
assessment can be undergirded by cultural worldviews that are unique to a specific 
culture and unfamiliar or offensive to another. Worldview includes beliefs, values, 
perspectives, and perceptions (Whiston, 2005). The rather common practice of 
timed assessment deserves consideration in respect to worldview. In America, speed 
is often valued. Think of Americans' fascination with fast food, microwaveable prod- 
ucts, instant messaging, and turbo-charged cars. In many other cultures, however, 
speed is not valued. Reflection is considered sacred. Given this difference in world- 
view, it is not hard to see why a 4th-grade student new to this country may not score 
well on a timed multiplication test, even though the student may have mastered 
multiplication facts. 

Acculturation and Language 

Acculturation and language are additional diversity factors involved in assessment. 
Acculturation is a change process that occurs when an individual of one culture 
comes in contact with an individual or individuals of another culture. As a result of 
this process, individuals may take on different values, beliefs, and behaviors 
(Drummond, 2004; Fouad & Chan, 1999). The degree and rate of change depend 
on a number of factors, including power dynamics, issues of immersion, and indi- 
vidual personality characteristics (e.g., cultural identity development status, genera- 
tional status). Professional counselors often have the opportunity to work with 
clients, students, and families dealing with various stages of acculturation. It is not 
unusual for a professional school counselor to work with a student who, due to the 
school setting and peer interactions, is bicultural yet lives in a family setting that 
largely maintains the traditions and practices of the student's native culture. A cul- 
turally competent counselor is prepared to recognize and effectively handle the coun- 
seling implications these issues of acculturation may have for students and clients. 

A growing number of Americans have the ability to communicate in more than 
one language, but proficiency in the languages they speak may vary considerably 
(Rogers, 1998). This growing phenomenon affects assessment in many interesting 
and diverse ways. Language is more than words and pronunciation. Language in- 
cludes structure, nuance, denotation, and connotation. These components preclude 
the simple use of translations or unstandardized forms of a test (Fouad & Chan, 
1999; Whiston, 2005). For example, it is inappropriate to have a staff member sim- 
ply translate a test for Spanish-speaking clients. The process of ensuring that the 
"translation" is equivalent to the original test involves sophisticated statistical and 
content analyses that extend beyond the scope of this chapter. It is important to note 
that it is equally inappropriate to assume that a Caribbean student who has just 



Foundations of Assessment 93 

moved to this country must take a test with no accommodation simply because the 
child has always been educated in "English." Differences in sentence structure, word 
meaning, idioms, and nuance may affect the student's ability to perform on the test. 
Of course, there are times when mastery of the English language is the objective of 
the test. In these cases, students and clients should be given the opportunity to 
demonstrate their understanding of the language by being tested in the given lan- 
guage. When English competence is not the issue, however, the language considera- 
tions discussed must be examined. Although professional counselors are not often 
involved in developing assessments, in their role as advocates, they must ensure that 
issues of language are fully explored when assessing students and clients with limited 
English proficiency, bilingual abilities, or multilingual capabilities. 

Socioeconomic Status 

Research suggested that socioeconomic status is a significant factor in assessment 
(Flanagan, 1993). Herring (1997) suggested that social class is the most important 
factor affecting the counseling process. Furthermore, there is a line of research de- 
scribing the confounding issues of race, ethnicity, and social class (Fouad & Chan, 
1999). Although social class cuts across all races and ethnicities, poverty dispropor- 
tionately affects clients and families of color. When these findings are merged with 
census data regarding poverty rates, it becomes evident that professional counselors 
must be aware of issues of socioeconomic status and assessment. Socioeconomic sta- 
tus is about much more than money. Social class may affect students' values, world- 
views, emotional resources, and support systems (Payne, 2003). 

Student and Client Factors 

A host of student and client factors, including test-taking attitudes, experience and 
capabilities, motivation, and social desirability, can affect assessment (Cohen & 
Swerdlik, 1999; Drummond, 2000; Fouad & Chan, 1999). These factors, discussed 
in much greater detail in Chapter 8, are unique to the individual and may change 
from test situation to test situation. For example, it is conceivable that a teenager 
may perform better on a multiple-choice test on social studies vocabulary than a 
true-false test on the same vocabulary. The difference in performance may be due 
only to the student's familiarity with the test format. Or clients with a visual impair- 
ment may have their test performance greatly diminished by their Braille and key- 
boarding skills rather than their knowledge of the material. Additionally, it is not dif- 
ficult to imagine a situation in which clients give the answer they feel the professional 
counselor wants, or the answer that is the most socially desirable. There is a strong 
literature base that suggests social desirability is a significant issue in assessment for 
many groups of clients (Marin & Marin, 1991). This factor may differentially affect 
cultural groups. 

Traditionally, professional counselors work with clients and students individu- 
ally and in small groups on decreasing test anxiety and strengthening test-taking 
strategies. Professional counselors should also provide direct student and client serv- 
ice and work as advocates to address these and other factors affecting assessment. 



94 Chapter 2 



Counselor and Examiner Factors 

Counselor and examiner factors comprise a final category of diversity issues that 
must be considered in assessment. Counselor and examiner factors include profes- 
sional competence; comfort with the assessment process; perceptions and worldview; 
race, ethnicity, and culture; and social influence. These important issues are ad- 
dressed in the Standards for Multicultural Assessment (AACE, 2003b): 

Culturally competent counselors have training and expertise in the use of traditional 
assessment and testing instruments. They not only understand the technical aspects of 
the instruments but also are aware of the cultural limitations. This allows them to 
use test instruments for the welfare of clients from diverse cultural racial, and eth- 
nic groups. 

Selection of Assessment Instruments: Content and Purpose 

Culturally competent counselors have knowledge about their social impact on others. 

Interpretation and Application of Assessment Results 



BIAS IN ASSESSMENT 



Some or all of the factors discussed in the preceding section can result in assess- 
ment bias. Standardization samples may also affect bias (Reynolds & Brown, 
1984). According to Whiston (2005, p. 211), bias "refers to the degree that con- 
struct-irrelevant factors systematically affect a group's performance." Construct- 
irrelevant factors are those facets not related to the idea being assessed. An assess- 
ment or test item is said to be biased when "empirical evidence shows that it is 
more difficult for one group member than another, the general ability level of the 
two groups is held constant, and no reasonable rationale exists to explain the group 
difference on the same items" (Drummond, 2000, p. 356). Three types of bias — 
content bias, internal structure bias, and predictive bias — have particular implica- 
tions for diverse populations. 



Content Bias 



Content bias refers to test material being more familiar to one group than another. 
Our earlier example involving professional counselors, psychologists, psychiatrists, 
family therapists, and social workers provides a simplistic illustration of content bias. 
Content bias is often less obvious when affecting multicultural populations, how- 
ever. Content bias may involve hidden messages or values of a culture that are not 
readily visible due to cultural encapsulation. Consider two well-documented items 
from the Wechsler Intelligence Scale for Children — Revised (WISC-R) (Kaplan & 
Saccuzzo, 2001; Sattler, 2001). One question asks, "What would you do if you were 
sent to buy a loaf of bread, and the grocer said he did not have any more?" Another 
question, which has been subject to much controversial investigation, asks, "What 
should you do if a child smaller than you begins to fight with you?" Although re- 
search findings differ, these questions appear to contain embedded cultural values, 
behaviors, and norms that may not hold consistent over all multicultural groups 



Foundations of Assessment 95 

(Hardy, Welcher, Mellitis, & Kagan, 1976; Koh, Abbatiello, & McLoughlin, 1984; 
Sandoval, Zimmerman, & Woo-Sam, 1983). Student responses to these questions 
may not be a measure of intelligence, but rather a measure of cultural values, behav- 
iors, and norms. 



Internal Structure Bias 



Predictive Bias 



Scores on an assessment may be reliable for one group, but not reliable for another. 
Or scores on an instrument may be more reliable for one group than another. This 
phenomenon is called internal structure bias. Internal structure bias can be due to 
norming factors or the underlying factor structure of an instrument. In light of this, 
some assessment instruments report differences between groups of test takers. For 
example, an assessment instrument may report differential reliability data based on 
gender, age, or ethnicity. 



A test can also be biased if it systematically over- or underpredicts a group's perform- 
ance. This type of bias is called predictive bias. Many professional counselors are fa- 
miliar with debate about the ability of standardized assessments like the Scholastic 
Assessment Test-I (SAT-I) to predict students' performance in college (McCornack 
& McLeod, 1988). "Gifted and talented" testing and success in special accelerated 
educational programming embody another common area of concern regarding pre- 
dictive bias. Generally, predictive bias is investigated along the lines of gender, race, 
and ethnicity. 



Interpreting Test Scores With Caution 



Some test manuals and texts, including this one, use the phrase "interpret with cau- 
tion" to warn readers about possible problems with the interpretations of scores. So 
what does the warning actually mean? In the context of this discussion on diversity, 
it usually means that we don't know the consequences of interpreting the score for a 
given individual with diverse characteristics. For example, some tests have norms 
that undersampled participants from various cultural backgrounds. If test norms un- 
dersampled African Americans, for instance, interpretations of an African American 
client's score may result in some inaccuracies. Unfortunately, without extensive em- 
pirical study, it is often extremely difficult to determine what the possible effects of 
undersampling may be. Empirical studies often explore differences between partici- 
pants with diverse characteristics and provide helpful conclusions about whether 
scores generated by the test yield appropriate inferences about the examinee. While 
it is best practice to use tests that will yield reliable and valid scores for the individ- 
ual being tested, often such tests either do not exist or are suspect for individuals 
with certain characteristics. So when you encounter the phrase "interpret with cau- 
tion," it may have several different potential meanings, but the phrase always should 
be taken into account when making decisions about the client's life. 



96 Chapter 2 



Ensuring Fairness in Assessment 



Test bias is a critical and alarming issue. Nonetheless, tests and other forms of assess- 
ment do have an important role in educational and clinical settings. Sattler (2001) 
suggested that good assessment offers an objective standard, reveals disparity, ap- 
praises functioning, obtains appropriate programming, and evaluates programs. All 
of these functions of assessment are significant to the professional counselor's work 
with clients and students. How, then, can the professional counselor work to ensure 
fairness in testing? The question is complex and multifaceted. The following sugges- 
tions offer some initial strategies, interventions, and recommendations: 

■ Remember that the professional counselor's primary responsibility is the welfare 
of all clients. Ensure that the focus of any and all assessment is to benefit the 
client. 

■ Engage in professional development opportunities (e.g., continuing education 
and training) to continue to learn about self, multicultural counseling, and diver- 
sity in educational and clinical issues and settings. 

■ Continually monitor and challenge personal belief systems and attitudes regard- 
ing all aspects of diversity. 

■ Demonstrate competence in multicultural counseling knowledge, skills, and be- 
liefs. Employ culturally sensitive approaches when working with clients and 
families. 

■ Abide by the ACA Code of Ethics (2005a) and other pertinent standards, includ- 
ing the Standards for Multicultural Assessment (AACE, 2003b) and the 
Multicultural Counseling Competencies and Standards (Sue, Arredondo, & 
McDavis, 1992). 

■ Become familiar with assessment instruments and procedures for the given pop- 
ulation. As appropriate, become fully competent in all aspects of administration, 
interpretation, and application of assessment results. 

■ Do not attempt to use assessment procedures outside of your scope of ex- 
perience. 

■ Refer students and clients for assessment as warranted. 

■ Consult with other mental health professionals, including clinical psychologists, 
school psychologists, and social workers, to become familiar with the ways they 
use assessment to serve clients. 

■ Test clients and students in the appropriate language. Use only translations with 
established validity. 

■ Use only valid and appropriate test adaptations and modifications. Do not as- 
sume that counselor- or teacher-made changes arc appropriate without first con- 
sulting the test manual. 

■ Consult with special educators, school psychologists, and other specialists to en- 
sure that students receive appropriate test accommodations. Accommodations 
may include changes in setting, scheduling, timing, presentation, or response 
format (Spinelli, 2002). 

■ Use multiple assessment methods to gain a more complete picture or a client or 
student. 



Foundations of Assessment 97 

Clarify test purpose, procedure, and expectations to clients and students. 
Provide individual and group counseling support for stress and anxiety related to 
assessment as needed. 

Provide individual and group counseling support for motivation and test prepa- 
ration as needed. 

Actively advocate for continued research on culturally appropriate assessment 
and counseling intervention for all clients and students. 



SUMMARY/CONCLUSION 



This chapter has discussed various historical, ethical, legal, and diversity issues in as- 
sessment, and provided resources for understanding how best to use assessment re- 
sults in clinical practice. However, because legislation and litigation are an ongoing 
process, professional counselors must stay updated on current issues in assessment 
and must also continuously assess their behavior to ensure that it meets the highest 
ethical standards. Best practices in assessment are really ethical and legal practices. 

This chapter has also highlighted and summarized key events in the evolution 
of assessment, from its historic roots to its current ethical concerns. Knowledge of 
such events and issues helps present-day professional counselors to understand the 
context for today's concerns, both within the profession and in society at large. 
Today, professional counselors are involved in a variety of ways in ensuring that 
clients and students receive quality assessment. Legal and ethical standards mandate 
that all clients receive assessment that is appropriate, unbiased, and meaningful. This 
mandate challenges professional counselors to understand the implications of diver- 
sity and assessment, and all that is involved in administering culturally competent 
assessment and in interpreting results. With this charge in mind, assessment can 
offer useful and important information for diverse client populations. 



KEY TERMS 



acculturation 
achievement 
aptitude 
bias 

career assessment 
case law 

clinical assessment 
code of ethics 
confidentiality 
content bias 
culture 
diversity 

Family Educational Rights and 
Privacy Act (FERPA) 



Health Insurance Portability and 
Accountability Act (HIPAA) 

high-stakes testing (HST) 

Individual Education Plan (IEP) 

Individuals With Disabilities 
Education Improvement Act 
(IDEIA) 

informed consent 

intelligence 

internal structure bias 

laws 

multicultural assessment 

No Child Left Behind Act (NCLB) 

personality assessment 



98 Chapter 2 



policy regulation 

predictive bias socioeconomic status 

Protection of Pupil Rights vocational development 

Amendment (PPRA) worldview 







Cli Ar fc r 



3 



Reliability | 

by Dimiter Dimitrov 



Reliability of scores is a critical issue in measurement. This chapter reviews 
basic principles in reliability, such as classical test theory and standard error 
of measurement in classical test theory. It also discusses the types of reliabil- 
ity commonly used by test developers, including internal consistency, test-retest, al- 
ternate form, criterion-referenced, and interscorer reliability. Finally, the concepts of 
attenuation and reliability of composite scores are discussed. Advanced concepts of 
dependability and generalizability of scores are included on the companion website 
for this text. 



WHAT IS RELIABILITY? 



Reliability means consistency. Measurements in the physical sciences can often be 
conducted with great precision (e.g., millimeters, grams). However, measurements in 
counseling, education, and related fields are not completely accurate and consis- 
tent — and are sometimes far from it. There is always some error involved, usually 
due to a person's conditions (e.g., mood, fatigue, momentary distraction) and/or ex- 
ternal conditions (e.g., noise, temperature, light), that may randomly occur during 
the measurement process. The way instruments of measurement (e.g., tests, inven- 
tories, or raters) are designed or the way questions or items are phrased may also af- 
fect the accuracy of the scores (observations). 

For example, it is unlikely that the scores of a person on two different forms of 
an anxiety test would be equal, because differently worded items often yield varying 
results. Also, different scores are likely to be assigned to a person when different pro- 
fessional counselors evaluate a specific attribute of the person (e.g., introversion, 



99 



TOO Chapter 3 



sociability, self-esteem). In another scenario, if a group of people takes rhe same test 
twice within a short period of time, one can expect the rank order of their scores on 
the two test administrations to be somewhat similar, but not exactly the same. In 
other words, one can expect a relatively high, yet not perfect, positive correlation of 
test-retest scores for this group of examinees. As still another example, when it comes 
to making placement decisions about clients, inconsistency may occur in different 
criterion-referenced classifications (e.g., pass-fail group labels or mastery-nonmas- 
tery group labels) based on measurements obtained through testing or subjective 
judgments of raters (e.g., teachers, parents). 

In measurement parlance, the higher the accuracy and consistency of measure- 
ment scores, the higher the reliability. The reliability of scores indicates the degree 
to which they are accurate, consistent, and repeatable when (a) different people con- 
duct the measurement, (b) different instruments are used that purport to measure 
the same trait (e.g., proficiency, ability, attitude, anxiety), and (c) there is incidental 
variation in measurement conditions (e.g., lighting, seating, temperature). In other 
words, reliable scores are produced by tests that are free from errors of measurement. 
Reliability is a key indicator of quality measurements with tests, surveys, inventories, 
or individuals (e.g., raters, judges, observers). Most important, reliability is a neces- 
sary (albeit not sufficient) condition for the validity of measurements. Validity refers 
to the meaningfulness, accuracy, and appropriateness of interpretations and decisions 
based on measurement data. Thus if professional counselors cannot measure a client 
characteristic consistently (reliability), they cannot make accurate interpretations 
(validity). 

It is important to note that reliability refers to the scores obtained with a test 
and not to the instrument itself. Previous studies and recent editorial policies of pro- 
fessional journals (e.g., Dimitrov, 2002; Sax, 1980; Thompson & Vacha-Haase, 
2000) emphasize that it is more accurate to refer to "reliability of measurement data" 
than to "reliability of tests" (e.g., items, questions, tasks). Tests cannot be accurate, 
stable, or unstable, but observations (scores) can be (i.e., tests are neither reliable or 
valid, but scores on tests can be). Therefore, any reference to reliability of a test 
should be interpreted to mean the reliability of scores derived from the test. 

As is discussed in Chapter 4, the most important characteristic of any measure- 
ment is its validity — that is, the degree to which scores lead to meaningful and ap- 
propriate interpretations. To allow for such interpretations, however, the scores 
should be accurate and consistent (i.e., reliable). The criterion-related validity of an 
entrance examination, for example, is assessed by the correlation between the exam- 
inees' scores on this test and their scores on a criterion (e.g., grade point average at 
the end of the first academic year). However, under the classical model of reliability, 
a criterion-related validity coefficient of test scores cannot exceed the square root of their 
reliability. More simply put, the reliability of scores predetermines a "ceiling" for the 
validity of a test's scores. How closely this ceiling will be approached depends on 
other factors as well. But at this point it is essential to understand that reliability is a 
necessary, but not sufficient, condition for validity. That is, high validity can occur 
il lest scores are highly reliable but cannot occur if test scores have low levels of reli- 
ability. On the other hand, just because test scores are highly reliable does not mean 



Reliability 101 

they will have high validity. For example, just because you can measure your height 
consistently (high reliability) does not mean that height indicates intelligence (low 
validity). 



THE CLASSICAL MODEL OF RELIABILITY 
True Score 



Scores on performance tests, personality inventories, expert evaluations, and even 
physical measurements are not completely accurate, consistent, and repeatable. For 
example, although the height of a person (i.e., one's "true height") remains constant 
throughout repeated measurements within a short period of time (say, 15 minutes) 
using the same scale, the observed values would be scattered around this "true 
height" due to the equipment being used or imperfection in the visual acuity of the 
measurer (whether the same examiner or somebody else). Thus, if T denotes the per- 
son's constant true height, then the observed height (X) in any of the repeated meas- 
urements will deviate from Twith an error of measurement (E). That is, 

X=T+E (3.1) 

In classical test theory, one often refers to a client's observed score {X, the score 
the client received on a test) and the client's true score ( T, the score the client would 
have received if the test and testing conditions were free of error [£). Thus, if E = 
(i.e., there is no error), the observed score is the true score (i.e., if E= 0, then X= T). 

To grasp what is meant by true score in classical test theory, imagine that a per- 
son takes a standardized intelligence test each day for 100 days in a row. The person 
would likely obtain a number of different observed scores over these occasions. The 
mean of all observed scores would represent an approximation of the person's true 
score ( T) on the standardized intelligence test. In general, the true score is the aver- 
age of the (theoretical) distribution of scores that would be observed in repeated in- 
dependent measurements of a person with the same test. Importantly, the true score 
(T) is a hypothetical concept, for it is not practically possible to test the same person 
infinity times in independent repeated measurements because each testing could in- 
fluence the subsequent testing (i.e., practice effects, memory effects). 

It is important to note that the error in Equation 3.1 is assumed to be random 
in nature. Possible sources of random error are (1) fluctuations in the mood or 
alertness of persons taking the test due to fatigue, illness, or other recent experi- 
ences; (2) incidental variation in the measurement conditions due, for example, to 
outside noise or inconsistency in the administration of the instrument; (3) differ- 
ences in scoring due to factors such as scoring errors, subjectivity, or clerical errors; 
and (4) random guessing on response alternatives in tests or questionnaire items. 
Conversely, systematic errors that remain constant from one measurement to an- 
other do not lead to inconsistency and therefore do not affect the reliability of the 
scores. Systematic errors will occur, for example, when one professional counselor 
assigns 2 points lower than another professional counselor to each person in a 



1 02 Chapter 3 



group of examinees. So, again, the reliability of any measurement is the extent to 
which the measurement results are free of random errors. Random error affects relia- 
bility; systematic error does not. 



Classical Definition of Reliability 



Equation 3.1 represents the classical assumption that any observed score {X) consists 
of two parts: true score ( T) and error of measurement (E). Because errors are random, 
it is assumed that they do not correlate with the true scores (i.e., r TE = 0). Indeed, 
there is no reason to expect that persons with higher true scores would have system- 
atically larger (or smaller) measurement errors than persons with lower true scores. 
Under this assumption, Equation 3.2 is true for the variances (o 2 ) of observed scores, 
true scores, and errors for a population of test takers: 

G^=o\+ol (3.2) 

that is, the observed score variance (<3y) is the sum of true score variance (g\) and 
error variance (fj |). Given this, the reliability of measurements, r^, indicates what pro- 
portion of the observed score variance is true score variance. The analytic translation of 
this definition is 



o 



r xx ~ 



_ "T _ 



(3.3) 



"x 



The definition of reliability implies that the reliability takes values from 0.00 to 
1.00. The closer r^x is to 1.00, the higher the reliability, and, conversely, the closer 
t^x ' s to zero » the lower the reliability. Perfect reliability (rxx = 1 -00) can theoretically 
occur when the total observed score variance is true score variance (cj x = g t) or ' 
equivalently, when the error variance is zero (rj \ = 0). 

In general, reliability coefficients in the 0.80s are desirable for screening tests, 
0.90s for diagnostic decisions (Salvia & Ysseldyke, 2004). Reliabilities of less than 
0.80 indicate substantial error variance and subsequent inconsistent conclusions. 
This is not to say that scores based on rxx * 0-^0 cannot be helpful for hypothesis 
generation (exploring problems or strengths in areas of client functioning); for hy- 
pothesis validation (confirming suspected problems or strengths in areas of client 
functioning); or for instruments used in research studies for the purpose of defining 
a construct (e.g., self-efficacy, anxiety). However, important decisions about a client's 
life should be based on more consistently derived information. 



Standard Error of Measurement (SEM) 



Classical test theory also proposes two additional assumptions: (a) that the distribu- 
tion of observed scores that a person may obtain under repeated independent test- 
ings with the same test is normal, and (b) that the standard deviation of this normal 
distribution, referred to as the standard error of measurement (SEM), is the same 
for all persons taking the test. Figure 3.1 represents a hypothetical normal distribu- 
tion of observed scores for a person with a true score of 20 for a specific test. The 



Reliability 103 




Figure 3.1 Theoretical distribution of observed scores for 
repeated independent testings of one person with the 
same test 



mean of the distribution is the person's true score (T= 20), and the standard devia- 
tion is the standard error of measurement (SEM = 2). 

Based on the statistical properties for normal distributions, about 95% of the 
scores fall in the interval from 2 standard deviations below the mean to 2 standard 
deviations above the mean. In Figure 3.1, this is the interval from T— 2{SEM) to 
T + 2{SEM), which in this case is from 16 to 24 [i.e., 20 - 2(2) to 20 + 2(2)]. This 
property can be used to construct (approximately) a 95% confidence interval of a 
person's true score ( T) falling within the given observed score (X) range based on the 
person's performance in a single testing: 



X- 2{SEM) <T<X + 2{SEM) 



(3.4) 



For example, if 23 is the person's observed score in a single real testing (X= 23), 
then the true score of this person is expected (with about 95% confidence) to fall in 
the interval from 23 - 2(SEM) to 23 + 2{SEM). This range of scores within which 
the true score probably lies is called a confidence interval because it gives the degree 
of confidence an examiner can expect regarding whether the client's true score lies 
within the given interval. In this example, with SEM '= 2, the 95% confidence inter- 
val for the person's true score is from 23 - 2(2) to 23 + 2(2), or from 19 to 27. 

When it comes to understanding and using confidence intervals, it is useful to 
know that (a) about 68% of all possible observed scores in Figure 3.1 fall in the in- 
terval from T- l(SEM) to T + \(SEAf) — i.e., from 18 to 22 in this case; (b) about 
95% of all possible observed scores in Figure 3.1 fall in the interval from T- 2{SEM) 
to T + 2(SEM) — i.e., from 16 to 24 in this case; and (c) almost all (99.7%) of the 
observed scores in Figure 3.1 are in the interval from T- 3(SEM) to T + 3(SEM), 



1 04 Chapter 3 



which in this case is from 14 to 26. You may have noticed that these percentages (i.e., 
68%, 95%, 99.7%) are the same percentages under the normal curve used in the 
discussion of standard deviation. This is because the SEM is, in effect, the standard 
deviation for the individual, with the individual's true test score standing at the cen- 
ter and the SEM serving as the "personal standard deviation," based on the test score 
reliability coefficient. 

A smaller SEM will produce smaller confidence intervals for the person's true 
score, thus improving the accuracy of measurement. Also, because the SEM is in- 
versely related to reliability, high reliability indicates high accuracy of measurements 
(lower SEM). SEMs are much more helpful than reliability coefficients when report- 
ing client test scores. The reliability coefficient is a unitless number between and 1 
conveniently used to report reliability in empirical studies. But the SEM relates di- 
rectly to the meaning of the test's scale of measurement (e.g., raw number-righr 
score, deviation IQ score, T score, z-score) and is therefore more useful for score in- 
terpretations (e.g., Feldt & Brennan, 1989; Thissen, 1990). The SEM is related to 
the reliability, r xx> and the standard deviation of the observed scores, as follows: 

5£M = O xx /l-r xx . (3.5) 

To compute the SEM, one needs to know the reliability and standard deviation 
of the client's test score. For example, if the reliability is 0.90 and the standard devi- 
ation of the client's observed scores is 15 (such as is the case for the deviation IQ, a 
standard score scale with an M = 100 and SD = 15 — a scale commonly used in in- 
telligence and achievement tests), then the standard error of measurement is 

SEM = 1 5>/l-0.9 = 1 5(0.3 162) = 4.743. 

Some test manuals leave it to the test user to compute the client's confidence in- 
terval, sometimes providing only reliability coefficients; others provide confidence 
intervals in norm conversion tables. Professional counselors understand that even 
though it is often necessary to make decisions about clients based on an observed or 
obtained score, it is not appropriate to interpret a single observed score to a client. 
Instead, it is appropriate to report and interpret the range of scores within which the 
true score probably lies. 

Furthermore, it is most appropriate to interpret these scores at the 95% level of 
confidence (± 2 SEM). Some test manuals and computer scoring programs recom- 
mend interpretation at the 68% level of confidence, which means that the client's 
true score will fall outside the suggested range in 1 out of every 3 reports (i.e., the 
68% level results in an average "mistake rate" of 32%!). Most clinicians (and clients) 
find it unacceptable to be wrong in one 1 of every 3 decisions — especially decisions 
related to diagnosis and treatment. Using the 95% level of confidence (± 2 SEM) 
means that the true score falls in the reported range 95 out of 100 administrations. 
A 5% error rate is much more acceptable in clinical practice, especially when mak- 
ing decisions about peoples' lives that may influence treatment for months or years 
into the future. 

Consider the following examples of how to apply SEM to score interpretation. 
If a client's full-scale [Q (FSIQ) score on the WAIS-II1 is 1 10, and the SEM is equal 



Reliability 105 

to 4 standard score points, the client's IQ could be interpreted at the 95% level of 
confidence (± 2 SEM) as 1 10 ± 8 (e.g., 2x4). Thus, on 100 alternative-form ad- 
ministrations of the WAIS-III, the client's FSIQ would probably fall within the FSIQ 
range of 102-1 18 about 95 times. This means that the professional counselor may 
have 95% confidence that the client's true IQ score falls between 102 and 118 (also 
referred to as the Average to High Average range). Likewise, the client's Couriers' 
Adult ADHD Rating Scales (CAARS) (Conners, Erhardt, & Sparrow, 1999) DSM- 
TVinattention scale T score of 71, with an SEM = 3 points (T score units), would be 
interpreted at the 95% level of confidence (±2 SEM) as 71 ± (2 x 3) = 71 ±6. Thus, 
on 100 alternate form administrations of the CAARS DSM-IV inattention scale, the 
client's T score would probably fall within the T score range of 65-77 about 95 
times. 



Think About It 3.1 If a client's observed score on the MMPI-2 
Depression scale is a T score (M = 50, SD =10) of 67, and the scale's reliabil- 
ity is 0.82, what is the client's likely range of scores at the 95% level of confi- 
dence? Given this information, would you be inclined to support a diagnosis 
of depression for this client? Explain. 



TYPES OF RELIABILITY 



The reliability of test scores for a population of examinees is defined as the ratio of 
their true score variance (7") to observed score variance (see Equation 3.3). 
Equivalently, the reliability can also be represented as the squared correlation be- 
. tween true and observed scores (i.e., r^x = r XT ). Unfortunately, in empirical research, 
true scores cannot be directly determined. Thus the reliability is typically estimated 
by coefficients of internal consistency, test-retest, alternate forms, and other types of 
reliability estimates adopted in the measurement literature. It is important to em- 
phasize that different types of reliability relate to different sources of measurement 
error and, contrary to common misconceptions, are generally not interchangeable. 



Internal Consistency 



Internal consistency estimates of reliability are based on the average correlation 
among items within a test or scale. A huge advantage of internal consistency is that 
participants need to receive only one administration of a single test on a single occa- 
sion. A widely known method for determining internal consistency of test scores is 
split-half reliability. Using the split-half method, the researcher literally divides the 
questions into two halves, either by an odd-even method or by some other strategy. 
Each half of the items is treated as a separate test, and the total scores of these two 
half-tests for each participant are correlated together. With this method, the two 
halves are assumed to be parallel (i.e., the two halves have equal true scores and equal 
error variances). 



106 Chapter 3 



However, because halving the number of items on a test substantially lowers 
the correlation (i.e., all other things being equal, the greater the number of items, 
the higher the correlation — thus halving the number of items lowers the correla- 
tion), an estimation formula is required to predict what the internal consistency 
of the items would be if returned to the size of the original complement of items. 
The score reliability of the whole test is estimated using the Spearman-Brown 
Prophecy formula: 

(3.6) 



'XX ~ l + r,, 

where r 12 is the Pearson correlation between the scores on the two halves of the test. 
For example, if the correlation between the two test halves is 0.6, then the split-half 
reliability estimate is: r^ = 2(0.6)/(l + 0.6) = 0.75. 

The Spearman-Brown Prophecy formula can also be used to determine the 
likely result of adding more items to a given scale. Following on the example above, 
if the number of test items yielding the internal consistency coefficient of 0.75 were 
doubled yet again (this is what the value 2 in the numerator designates), the result- 
ing reliability coefficient would be r^ = 2(0.75)/(l + 0.75) = 1.50/1.75 = 0.83. 

How one splits the items into two equivalent halves when computing internal 
consistency is very important. One commonly used approach to forming test halves, 
called the odd-even method, is to assign the odd-numbered test items to one half and 
the even-numbered test items to the other half of the test. This method is particu- 
larly appropriate when the items are presented in order of increasing difficulty, such 
as on an achievement or intelligence test. Perhaps an even more appropriate method 
would be to stagger the assignments to even out the item difficulty levels (i.e., sum 
items 1 , 4, 5, 8, 9 versus items 2, 3, 6, 7, 10). 

A more recommended approach, called matched random subsets, involves three 
steps. First, two statistics are calculated for each item: the proportion of individuals 
who answered the item correctly (i.e., the item difficulty) and the point-biserial cor- 
relation between the item and the total test score. Second, each item is plotted on a 
graph using these two statistics as coordinates of a dot representing the item. Third, 
items that are close together on the graph are paired, and one item from each pair is 
randomly assigned to each half of the test. 

Computer programs, such as SPSS, are frequently used to compute internal 
consistency estimates. Researchers and test users should use caution to ensure that 
proper item matching procedures were used, lest the computer default to a proce- 
dure that will overestimate a scales internal consistency, leading to undue confidence 
in score reliability. Importantly, if the instrument consists of different scales yielding 
interpreted scores, internal consistency should be estimated for each scale. For ex- 
ample, the Disruptive Behavior Rating Scale (DBRS) (Erford, 1993) is composed of 
four subscales: Distractible, Oppositional, Impulsive-Hyperactive, and Antisocial 
Conduct. There is no interpretable total score, and each subscale score is interpreted 
as a separate subscale. Thus internal consistency coefficients for the observed scores 
on each scale arc ot interest. 



Reliability 107 

The Spearman-Brown Prophecy formula is not appropriate when there are in- 
dications that the test halves are not parallel (e.g., when the two test halves do not 
have equal variances). In such cases, the internal consistency of the scores for the 
whole test can be estimated with the Cronbach's coefficient (X (Greek letter alpha) 
using the formula (Cronbach, 1951): 

2[VAR(X)-VAR(A-.)-VAR(X 2 )] ,- _. 

a = VmF) -' (37) 

where VAR(X)> VAR(A r 1 ), and VAR^Q represent the sample variance of the whole 
test, its first half, and its second half, respectively. For example, if the observed score 
variance for the whole test is 40 and the observed variances for the two test halves are 
12 and 11, respectively, then coefficient alpha (a) = 2(40 - 12 - 1 1)/40 = 0.85. 

The coefficient alpha is usually calculated for more than two components of the 
test, and when item response formats are multiscaled (e.g., Very Dissatisfied, 
Dissatisfied, Satisfied, Very Satisfied; or Almost Never, Sometimes, Frequently, 
Almost Always). Each test component is an item or a set of items. Sometimes it is 
helpful to see the mathematical formulas to understand what comprises OC. But if you 
find this confusing, don't worry. Computers do all of these computations nowadays 
in a split second, using programs such as SPSS. 

The general formula for alpha (see Equation 3.8) is simply an extension of 
Equation 3.7 for more than two test components: 

J IVAR(^)1 ( } 

W »-l|_ VAR(A-) J' KJ ' 

where n is the number of test components (usually the number of items), X t is the 
observed score on the ith test component, VAR(^i) is the variance of X;, X is the ob- 
served score for the whole test (i.e., X = X! + X 2 + . . . + XJ, VAR(J0 is the variance 
of X, and Z (Greek capital letter sigma) is the summation symbol. 

When each test component is a dichotomously scored item (1= correct [or true], 
= incorrect [or false]), the coefficient a can be calculated by an equivalent formula, 
called Kuder-Richardson formula 20 (see Equation 3.9), with the notation KR-20 (or 
(X-20) for the coefficient of internal consistency: 

KR-20 = ^[l-^ji], (3.9) 

where n is the number of test items, X is the observed score for the whole test, 
VAR(J0 is the variance of X, p t is the proportion of persons who answered correctly 
item i, and/»j(l - p) is the variance of the observed binary scores on item i (Xj = 1 
or 0)— that is, VAR(A|) =p l (l - p). 

Again, high-speed computer programs, such as SPSS, make the computation of 
coefficient (X, or KR-20, rather simple. 

Recall from Chapter 1 that speeded tests are those on which few clients miss any 
items, but the score is determined by how many items a client finishes in a given pe- 
riod of time. With a speed test, the split-half correlation coefficient ordinarily would 



108 Chapter 3 



be close to zero if the test were split into the first half of items versus the second half 
of items, since most examinees would correctly answer almost all items in the first 
half and (running out of time) would miss most items in the second half of the test. 
Likewise, if the odd-even splitting method is used for a speeded test, the resulting 
correlation would be artificially high because clients usually would get all items cor- 
rect up until the point at which time ran out, and all subsequent items would be 
marked incorrect. Thus the score for odd items would almost always be within 1 
point of the even-item total. When determining the internal consistency of speeded 
tests, it is generally appropriate to split the test by time intervals, rather than items, 
and to combine the raw scores for these intervals into the two test halves. For exam- 
ple, on the WISC-IV's Coding subtest, one could observe how many items were re- 
sponded to correctly during each of the eight 15-second intervals that comprise the 
2-minute subtest. Then the number of items correctly responded to during the odd 
(1st, 3rd, 5th, and 7th) 15-second intervals could be summed and correlated with 
the sums of the even (2nd, 4th, 6th, and 8th) 15-second intervals for each partici- 
pant in the study. 



Test-Retest Reliability 



The extent to which the same persons consistently respond to the same test, inven- 
tory, or questionnaire administered on different occasions is known as the test-retest 
reliability of test scores. Sometimes test-retest reliability is also called temporal stabil- 
ity, meaning stability over time. Test-retest reliability is estimated by the correlation 
between the observed scores of the same people taking the same test twice; that is, 
the same participants take the same test on two separate occasions. The resulting cor- 
relation coefficient is also referred to as the coefficient of stability, because the primary 
source of measurement error is stability over time. Because tests are frequently used 
to track therapeutic progress or the effects of medication, test-retest reliability can 
provide helpful insights into how client scores are likely to vary simply due to a read- 
ministration of the same test on a second occasion. 

The major problem with test-retest reliability estimates is the potential for car- 
ryover effects between the two test administrations. Readministration of the test 
within a short period of time (e.g., a few days or weeks) may produce carryover ef- 
fects due to memory and/or practice. For example, students who take a math or vo- 
cabulary test may look up some answers they were unsure of after the first adminis- 
tration of the test, thereby changing their true knowledge on the content measured 
by the test. Likewise, the process of completing an anxiety inventory could trigger an 
increase in the anxiety level of some people, thus causing their true anxiety scores to 
change from one administration of the inventory to the next. This happens if the 
client is more or less anxious on a second administration of the anxiety inventory. 

If the construct (attribute) being measured varies over time (e.g., cognitive 
skills, depression), a long period of time between the two administrations of the 
instrument may produce carryover effects due to biological maturation, cognitive 
development, or changes in information, experience, .md/or moods, for example, 



Reliability 109 

if a student learns a lot about math between the first and second administration of 
a math achievement test, the student's score may increase substantially. Likewise, 
a client with depression who is administered the Beck Depression Inventory — Second 
Edition (BDI-II) (Beck, Steer, & Brown, 1996) may receive a lower score on the 
second administration of the BDI-II six months later, regardless of whether treat- 
ment was successful. 

Thus, test-retest reliability estimates are most appropriate for measurements of 
traits that are stable across the time period between the two test administrations (e.g., 
visual or auditory acuity, personality, work values). In addition to problems with car- 
ryover effect, there is also a practical limitation to retesting, because it is usually time 
consuming and/or expensive. For many tests, retesting solely for the purpose of es- 
timating score stability may be impractical, although it is frequently of interest to 
clinicians using tests as an outcome measure to know what degree of consistency to 
expect on test readministration. 

On a final note, researchers should always report the time interval between the 
first and second administrations of the test. This is because, normally, the longer the 
period of time between the two administrations, the lower the reliability (e.g., the 
greater the chances that some external factor or developmental change will occur). 



Alternate Forms Reliability (Equivalent Forms Reliability) 



One way of counteracting the practice effects that occur in test-retest reliability is to 
design two equivalent versions of a test. If two versions of an instrument (test, inven- 
tory, or questionnaire) have very similar observed score means, variances, and corre- 
lations with other measures, they are called alternate forms or equivalent forms of 
the instrument. In fact, any decent attempt to construct parallel tests is expected to 
result in alternate test forms, as it is practically impossible to obtain perfectly paral- 
lel tests (i.e., equal true scores and equal error variances). Alternate forms usually are 
easier to develop for instruments that measure, for example, abilities and aptitudes 
or specific academic abilities because of the larger potential item pools (i.e., domains 
of knowledge) than those that measure constructs that are more difficult to repre- 
sent with measurable variables (e.g., personality, motivation, temperament, anxiety). 
Thus professional counselors will frequently see alternate forms of achievement tests 
(i.e., Forms A and B of the WJ-II1 ACH [Woodcock, Mather, & McGrew, 2001] and 
the Blue and Tan forms of the WRAT-III [Wilkinson, 1993]), but they only rarely 
see alternate forms purposefully designed by a test author in the intellectual, behav- 
ioral, or personality domains. 

Alternate form reliability is a measure of the consistency of scores on alternate 
test forms administered to the same group of individuals — that is, two equivalent 
tests administered to the same participants on two separate occasions. The correla- 
tion between observed scores on two alternate test forms, referred to as the coefficient 
of equivalence, provides an estimate of the reliability of each of the alternate forms 
based on item content, scorer, and temporal stability. Just as with the test-retest reli- 
ability coefficients, the estimates of alternate form reliability are subject to carryover 



110 Chapter 3 



(practice) effects, but to a lesser degree, as the persons are not tested twice with the 
same items. To minimize carryover effects, a recommended rule of thumb is to have 
a 2-week time period between administrations of alternate test forms. 

Whenever possible, it is important to obtain both internal consistency coeffi- 
cients and alternate forms correlations for a test. If the correlation between alternate 
forms is much lower than the internal consistency coefficient (e.g., a difference of 
0.20 or more), this might be due to (a) differences in content, (b) subjectivity of 
scoring, and (c) changes in the trait being measured over time between the adminis- 
trations of alternate forms. To determine the relative contribution of these sources of 
error, it is usually recommended to administer the two alternate forms on the same 
day for half a sample of respondents, and then after a 2-week time interval for the 
other half of the sample (so long as the number of participants in each group is at 
least 10 or more for empirical purposes). If the correlation between the scores on the 
alternate forms for the same-day administration is much higher than the correlation 
for the 2-week time interval, then variation in the trait being measured is a major 
source of error (i.e., temporal instability). For example, it is likely that measures of 
mood will change over a 2-week time interval, and thus the 2-week correlation will 
be lower than the same-day correlation between the alternate forms of the instru- 
ment. However, if the two correlations are both low, the persons' scores may be sta- 
ble over the 2-week time interval, but the alternate forms probably differ in content. 

Likewise, when scores on alternate forms of an instrument are assigned by raters 
(e.g., counselors, parents, teachers), one may check for scoring subjectivity by using 
a three-step procedure: (1) randomly split a large sample of persons; (2) administer 
the alternate forms on the same day for one group of people; and (3) administer the 
alternate forms after a 2-week time interval for the other group of people. If the cor- 
relations between raters are high for both groups, there is probably little scoring error 
due to subjectivity. If the correlation over the 2-week time interval and the same-day 
correlation are both consistently low across different raters, it is difficult to deter- 
mine the major sources of scoring errors. Such errors can be reduced by training the 
raters in using the instrument and by providing clear guidelines for scoring behav- 
iors or traits being measured. 



Reliability of Criterion-Referenced Tests 



Criterion-referenced measurements show how the examinees stand with respect to 
an external criterion. The criterion is usually some specific educational or perform- 
ance objective, such as "can apply basic algebra rules," "is able to recognize patterns," 
or even "is at risk for depression." 

Most teacher-made tests are criterion referenced because the teacher is more in- 
terested in how well students master coursework (criterion referenced) rather than 
how students did when compared with other students (norm referenced). Likewise, 
professional counselors frequently want to know whether a client has "enough" of a 
mental disorder (depression, anxiety oppositional behavior) to warrant a diagnosis. 
This is also a situation calling for criterion-referenced measurement. Because a 
criterion-referenced test may cover numerous specific objectives (criteria), each 



Reliability 1 1 1 



Table 3.1 Contingency table for mastery-nonmastery classifications 





Form B 






Master 


Nonmaster 




Master 


Pu 


Pu 


Pm 


Form A 

Nonmaster 


Pn 


Pll 


Pm 



Pm 



objective should be measured as accurately as possible. When the results of criterion- 
referenced measurements are used for classifications related to mastery or nonmas- 
tery of the criterion, the reliability of such classifications is often referred to as clas- 
sification consistency. This type of reliability shows the consistency with which 
classifications are made, either by the same test administered on two occasions or by 
alternate test forms. 

Two classical indices of classification consistency are (a) P = the observed pro- 
portion of persons consistently classified as mastery versus nonmastery and (b) 
Cohen's K (Greek letter kappa) = the proportion of nonrandom consistent classifica- 
tions. Their calculation is illustrated for the two-way data layout in Table 3. 1 , where 
the entries are proportions of persons classified as masters or nonmasters by two al- 
ternate test forms of a criterion-referenced test (Form A and Form B). Specifically, 
p n is the proportion of persons classified as "mastery" (those who mastered the con- 
tent to the specified level) by both test forms; p n 1S the proportion of persons clas- 
sified as "mastery" by Form A and "nonmastery" by Form B;/> 2 i r° r "nonmastery" of 
Form A and "mastery" on Form B; and p 22 as "nonmastery" on both forms of the 
test. Also, P Al , Pf^, P B1 , and P B2 are notations for marginal proportions — that is: 

^Al =PU + Pl2> P Vl = P\\ + p2V P A2 =Pl\ +p 2 2>* ndP K2 = P\2 + P2V The observed 

proportion of consistent classifications (mastery/nonmastery) is 



P o=Pu + p22 



(3.10) 



However, P can be a misleading indicator of classification consistency, because 
part of it may occur by chance. Cohen's kappa (see Equation 3.11) takes into account 
the proportion of consistent classification that is theoretically expected to occur by 
chance, P e , and provides a ratio of nonrandom consistent classifications 



l- P. 



(3.11) 



where P e is obtained by summing the cross-products of marginal proportions in 
Table 3.1: P e = ^ai^bi + P h2 P m- 1° Equation 3.1 1, the numerator (P - P e ) is the 
proportion of nonrandom consistent classification being detected, whereas the de- 
nominator (1 -P e ) is the maximum proportion of nonrandom consistent classifica- 
tion that may occur. Cohen's kappa indicates, then, what proportion of the maxi- 
mum possible nonrandom consistent classifications is found with the data. 



1 1 2 Chapter 3 



Think About It 3.2 Administering a substance abuse screening test 
along with a DSM-IV-TR diagnostic process, let us assign specific values to 
the proportions in Table 3. 1 (see Table 3.2): p x x = 0.3, p X2 = 0-2, p 2 \ =0.1, 
and/>22 = 0-4- These are nice even numbers, meaning that 30%, 20%, 10%, 
and 40% of the cases (decisions) fell into each category, respectively. The 
marginal proportions are: P A1 = 0.3 + 0.2 = 0.5, Pp^ = 0.1 + 0.4 = 0.5, 
P B] = 0.3 + 0.1 = 0.4, and P B2 = 0.2 + 0.4 = 0.6. 



Table 3.2 Contingency table for mastery-nonmastery classifications for 
identifying individuals with substance abuse 



Form B— DS/W-diagnosis 







Diagnosed 


Not diagnosed 




Form A 
Substance 
Abuse 
Test 


Diagnosed 
Not diagnosed 


0.3 
0.1 


0.2 
0.4 


0.3 + 0.2 = 0.5 
0.1 +0.4 = 0.5 




0.3 + 0.1 =0.4 


0.2 + 0.4 = 0.6 





With these data, calculate the observed proportion of consistent classi- 
fication P Q . You should have gotten P Q = 0.3 + 0.4 = 0.7 by using Equation 
3.10. 

Next, calculate K using Equation 3.1 1. The proportion of consistent 
classifications that may occur by chance in this hypothetical example is: P e = 
(0.5)(0.4) + (0.5)(0.6) = 0.5. Using Equation 3.1 1, the Cohen's kappa ratio 
is: k = (0.7 - 0.5)/(l - 0.5) = 0.2/0.5 = 0.4. 

Finally, interpret these results. For this example of using a substance 
abuse test, the initially obtained 70% of observed consistent classifications 
(P = 0.7) is reduced to 40% consistent classifications after taking into ac- 
count consistent classifications that may occur by chance. Because kappa 
provides "conservative" estimations of consistency, it is reasonable to report 
in this case that the classification consistency is between 0.40 and 0.70 (i.e., 
between K and PJ. Note: For practical purposes, it is recommended to report 
both P and Cohens kappa, as the latter is very conservative, thus underesti- 
mating the actual rate of consistent classifications. Previous research [e.g., 
Chase, 1996; Subkoviak, 1988] provides some additional procedures for esti- 
mating classification consistency, including scenarios with a single test ad- 
ministration or prior to the initial application of the test.) 



Reliability 1 1 3 



Interscorer and Interrater Reliability 



The chances of measurement error usually increase when the scores are based on sub- 
jective judgments of the person(s) doing the scoring. In general, the less objective 
the scoring procedures, the lower the interscorer reliability. Such situations occur, 
for example, with classroom assessment of essays or portfolios where the teacher is, 
in fact, the "judge" of performance. In another scenario, involving some projective 
tests of personality, the scorer (e.g., professional counselor, psychotherapist) should 
decide if the person's responses suggest normal functioning or some form of psy- 
chopathology. Subjective judgments of raters (experts, judges) are also used for clas- 
sification purposes (e.g., to determine a "minimum level of competency" in pass/fail 
decisions). In all scenarios of rater-based scoring, it is important to estimate the de- 
gree to which the scores are unduly affected by the subjective judgments of the raters. 
Such estimation is provided by coefficients of interrater reliability (also called coef- 
ficients of interrater agreement). 

Depending on the context of measurement, there are different methods of esti- 
mating interrater reliability. Frequently used classical measures of interrater reliabil- 
ity are the Pearson correlation coefficients, observed proportion of consistent classi- 
fication (P Q ) and Cohen's kappa coefficient. The Pearson r is by far the most 
commonly used measure of interscorer reliability when scores are interval, as most 
test scores (e.g., standard scores) are, or ratio. Otherwise, the two indices of classifi- 
cation (P and Cohen's kappa) can be used as estimates of interrater reliability when 
two raters (instead of two test forms) classify persons as mastery or nonmastery. 
When more than two categories are used by two raters to classify persons (or their 
products), one can still use Equation 3.1 1 for Cohen's kappa, but P and P e should 
be calculated with a contingency table for the respective number of categories. For 
example, with three classification categories (e.g., low, medium, and high perform- 
ance), P and P e are calculated as follows: P = p n + p 22 + ^33 and P e = P^\P%\ + 

Interrater reliability is also sometimes used to refer to two independent observers 
who rate another individual, such as when sets of mothers and teachers rate children 
on a behavior rating scale and the results are correlated. This type of relationship is 
better described as a type of criterion-related validity (see Chapter 4). In this in- 
stance, one set of scores (e.g., teachers) serves as the criterion for the other set of 
scores (e.g., mothers). If two raters independently assign scores (say, to portfolios) of 
students, then the Pearson correlation coefficient for the two sets of scores can be 
used as an estimate of interrater agreement. The higher the correlation coefficient, 
the lower the error variance due to scorer differences, and the higher the interrater 
agreement. 

When scoring of alternate forms of a measurement instrument is done by two 
or more raters, one can check for measurement error due to subjectivity of scoring 
by administering the alternate forms (a) on the same day for one group of subjects 
and (b) with a 2-week delay for another group of subjects. If the correlations between 
raters are high for both groups, there is probably little error due to subjectivity of 



114 Chapter 3 



scoring. If, however, the correlation over the 2-week time interval and the same-day 
correlation are both consistently low across different raters, it is difficult to deter- 
mine the major source of unreliability (subjectivity of scoring or, say, differences in 
content for the two alternate forms of the instrument). The interrater reliability can 
be improved by training the raters in the use of the instrument and providing clear 
guidelines for scoring (e.g., a more specific rubric or more specific criteria). 

Overall, researchers and test users can reduce measurement error and improve 
reliability by (1) writing items clearly, (2) providing complete and understandable 
test instructions, (3) administering the instrument under prescribed conditions, (4) 
reducing subjectivity in scoring, (5) training raters and providing them with clear 
scoring instructions, (6) using heterogeneous respondent samples to increase the 
variance of observed scores, and (7) increasing the length of the test by adding items 
that are (ideally) parallel to those that are already in the test. The general principle 
behind improving reliability is to maximize the variance of relevant individual differ- 
ences and minimize the error variance. 



THE IMPORTANCE OF RELIABILITY 
Reliability in Validation 



ATTENUATION 



The most important characteristic of any measurement is its validity — a concept re- 
ferring to the meaningfulness, appropriateness, and usefulness of the inferences 
made from the measurement scores. Validation is an ongoing process of gathering 
evidence to support such inferences. It is essential to understand that it is the infer- 
ences made from measurement scores that are being validated, not the instrument 
(e.g., test, survey, or questionnaire) being used to obtain such scores. 

The score reliability is an important (necessary, but not sufficient) condition in 
the validation process. For example, as noted earlier in this chapter, the reliability of 
scores predetermines a "ceiling" for their criterion-related validity, but how closely 
this ceiling will be approached depends on other factors as well. The validation of 
measurements in counseling usually deals with constructs (e.g., proficiency, motiva- 
tion, anxiety, empathy, and beliefs) and involves different types of evidence. The 
quality of such evidence depends, among other things, on the reliability of the data 
collected from different sources. The reliability also affects the results from correla- 
tional analyses and other statistical procedures used in the validation process. The 
term attenuation is used to indicate the reduction of the magnitude of such results 
due to unreliability of scores. 



If the reliability of the scores on two variables A" and Kis not perfect (i.e., r^ ^ 1 
and/or r YY * 1), the observed correlation between Xand Y, r XY , is attenuated (i.e., 
lower than the "actual" correlation between the person's true scores on the two vari- 
ables: T x and 7~ Y ). One can estimate the correlation between the true scores 7" x and 



Reliability 115 
7"y by using Equation 3.12, referred to as the correction for attenuation formula 
(Spearman, 1904): 



'T Y Tv 



— r XY 



4> 



r XX r YY 



(3.12) 



Think About It 3.3 The correlation between two variables, Self-esteem 
(X) and Persistence decisions (Y), in a study on academic persistence for col- 
lege undergraduates was found to be r^y = 0.35. Professional counselors in- 
volved in this study found also that the reliability of the two measures, ^and 
Y, for the study data was relatively low: r^ = 0.68 and ryy = 0.71, respec- 
tively. To estimate what would be the correlation between the two variables if 
their measurements were perfectly reliable, the professional counselors used 
Equation 3.12, thus obtaining much higher correlation (0.50) between the 
students' true scores (i.e., no error involved) on Self-esteem and Persistence 
decisions: 



'T Y Tv 



0.35 



V(0.68)(0.71) 



0.50 



Importantly, because perfect reliability is generally not obtainable, one 
cannot observe the corrected-for-attenuation correlation values. Such values 
indicate the highest correlation coefficients for perfectly reliable scores. 
Important conditions for using Equation 3.12 are (1) The reliability esti- 
mates, r^x and ryy > should also be accurate and (2) The components in the 
right-hand side of Equation 3.12 (r^y, ?xx> and Tyy) should be affected by 
the same measurement error — for example, if r^ is estimated when Jifand Y 
are measured during one testing session and their internal consistency esti- 
mates are used for r^ and r^ in Equation 3.12. However, if r^ and ryy are 
alternate form reliabilities, error of measurement involved in their estimation 
(due to time lapse and change of test form) would not be involved in the es- 
timation of the correlation between .Yand Yir-^). Then Equation 3.12 will 
produce an overestimated true score correlation between Xand Y ( r r T ) . 



Attenuation effects due to unreliability of data occur also in hypothesis testing 
with statistical methods. It should be noted, for example, that although the Pearson 
correlation coefficient between an independent variable X and a dependent variable 
(criterion) Fis attenuated by error of measurement, the regression coefficient (slope) 
in the regression of Y on Xis attenuated by measurement errors in X but not in Y 
(Bohrnstedt, 1983). Therefore, particular attention should be paid to the reliability 
of the pretest scores when they are used as a covariate (X), say, in the comparison of 
treatment groups, using the statistical method analysis of covariance (ANCOVA). 
The power of statistical tests is also attenuated by unreliability of the measurement 
data (to remind: the power of a statistical test of a null hypothesis is the probability 



1 1 6 Chapter 3 



that this test will lead to the rejection of the null hypothesis when it is false indeed). 
Specifically, the unreliability shrinks the observed effect size (e.g., produced by a spe- 
cific treatment), thus reducing the power of the statistical test (for more details, see, 
e.g., Cohen, 1988; Maxwell, 1980; Zimmerman & Williams, 1982). 



RELIABILITY OF COMPOSITE SCORES 



In many situations, scores from two or more scales are combined into composite scores 
to measure and interpret a more general dimension (trait, ability, or proficiency) re- 
lated to these scales. Composite scores are often used with test batteries for achieve- 
ment, aptitude, intelligence, depression, or eating disorders, as well as with local 
school measurements such as performance and portfolio assessments. One frequently 
reported composite score, for example, is the sum of verbal and quantitative scores 
of the Graduate Record Examination (GRE). Another example is the WISC-IV's 10 
core subtests, which yield four index scores (i.e., Verbal Comprehension Index 
[VCI], Perceptual Reasoning Index [PRI], Working Memory Index [WMI] and 
Processing Speed Index [PSI]), which are subsequently combined to yield the full- 
scale IQ (FSIQ). The scores on nine scales of the Symptom Checklist-90-Revised 
{SCL-90-R) (Derogatis, 1990) are combined into three "global" (composite) scores 
in measuring current psychological symptom status. A Total Aggressive Expression 
score with the Driving Anger Expression Inventory (DAX) (Deffenbacher, Lynch, 
Oetting, & Swaim, 2002) is also obtained as a sum of three scales: Verbal Aggressive 
Expression, Personal Physical Aggressive Expression, and Using the Vehicle to 
Express Anger. Thus, composite scores are frequently encountered in psychological 
and educational testing. 

Although the composite score may be simply the sum of several scale scores, its 
reliability is usually not just the mean of the reliabilities for the scales being com- 
bined. The issue of reliability estimation for composite scores is addressed in this sec- 
tion when the composite score is (a) the sum of two scale scores (e.g., GREs, SATs); 
(b) the difference score (e.g., gain score for pretest to posttest measurements or the 
difference between two independent scorers of a single set of portfolios); and (c) the 
sum of three or more scale scores (e.g., WISC-IV, SCL-90-R). 



Reliability of Sum off Scores 



Let the composite score Kbe the sum of two scale scores, X\ and X 2 : Y= X x + X 2 . 
With the GRE scoring, for example, the composite score is the sum of the verbal and 
quantitative scores. The reliability of the sum of two scores, ryy. can be estimated as 

r YY= l- q ?( 1 - r ") + °^ 1 -^), (3.13) 

where af is the variance ofX ]t that is: O^ = VAR(A",), O; is the variance of X 2 , that 
is: a| = VAR(A',), Oy is the variance of the composite score Y, that is: Cy = VAR(K), 
r M is the reliability of X, and /•,, is the reliability of A',. 



Reliability 



117 



Think About It 3.4 The estimation of the reliability for a composite 
score, Y= X x + X 2> is illustrated in this example with data from a study on at- 
titudes and behaviors of students related to their sexual activities. 
Specifically, X x is the score on a scale labeled "Love as Justification for Sexual 
Involvement," and X 2 is the score on a scale labeled "Sex for Approbation." 
With the notations adopted in Equation 3.13, the following results were 
obtained from the study data for (a) the variances of X x , X 2 , and Y: G x = 
13.750, a 2 2 = 10.433, C^ = 38.5992; and (b) the reliabilities of^ and A" 2 : 
r u =0.8334, r 22 = 0.8217. 

Replacing these components for their values in Equation 3.13, we 
obtain: 



^=1- 



13. 750(1 -0.8334) + 10.433(1 -0.821 7) 
38.592 



0.892. 



Thus, the reliability estimate of the composite score Y (0.892) in this ex- 
ample is higher than the reliability estimates of its components, X x (0.8334) 
and X 2 (0.8217). While this frequently occurs, it is not always the case. In re- 
ality, the larger the difference between r x x and r 22 , and the lower the correla- 
tion between the two components (r ]2 ), the less likely that ryy will exceed 
each individual component's reliability. 

Although not explicitly present, the correlation between X x and X 2 , de- 
noted r 12 , affects the reliability of the composite score. When X x and X 2 do 
not correlate (r 12 = 0), the reliability of their sum (Y= X x + X 2 ) is simply the 
average of their reliabilities: ryy = (r u + r 22 )l2. 



In many cases, the scores that are combined into a composite score come from 
scales with different units of measurement (e.g., 3-point and 5-point survey scales). 
Therefore, to present the measurements on a common scale (and for some technical 
reasons), the raw scores are often converted into standard scores (z-scores) before 
being summed (this is done, for example, with the raw scores of the primary psycho- 
logical symptoms measured with the self-report symptom inventory SCL-90-R). For 
the special case of standard (z-) scores, Equation 3.13 is converted into a simpler 
form (Equation 3.14): 



'YY 



1- 



2-1 



(3.14) 



where Gyz is the variance of the sum of the z-scores for X x and X 2 (i.e., Y z = 
z x + z 2 ), r xx is the reliability of X x , and r 22 is the reliability of X 2 . Assume that 
Oy Z = 3.203 and that r xx = 0.8334 and r 22 = 0.8217. With this, using Equation 
3.14, we obtain the value for the reliability of the composite score Y = X x and X 2 
(or, equivalently, for Y z = z x + z 2 ): 



r YY=i- 



2-(0. 8334 + 0.8217) 
3.203 



0.892. 



1 1 8 Chapter 3 



Note that Equation 3.14 follows directly from Equation 3.13, taking into ac- 
count that the variance of the standard (z-) scores for any variable is 1 and, thus, 
CJ 2 (z,) + G 2 (z 2 ) = 2. 

Equations 3.13 and 3.14 can be readily extended for cases where the compos- 
ite score is a sum of more than two scale scores (e.g., Nunnally & Bernstein, 1994). 
For the sum of three scores, for example, the reliability of the composite score Y= 
X x + X 2 + X$ can be estimated by extending Equation 3.14 to form Equation 3.15 
as follows: 

"< (3.15) 



'YY 



= 1- 



,2 
'YZ 



where CJyz is the variance of the sum of the standard (z-) scores for X x , X 2 , and Xy, 
that is, Y z = z x + z 2 + z 5 (r n , r 22 , and r 33 are the reliabilities for X x , X 2 , and X$, 
respectively). 



Reliability off Difference Scores 



The difference between two observers' scores for the same person, called difference 
score, is widely used in behavioral research primarily (a) to measure the person's 
growth across time points and (b) to compare the person's scores on academic, psy- 
chological, or personality variables. For example, measurement of change using the 
person's difference (or gain) score from pretest to posttest is used to assess the effect 
of specific educational programs, counseling treatments, and rehabilitation services 
or allied health interventions, all important facets of outcomes research in the men- 
tal health field. Clearly, the quality of the results and the validity of interpretations 
in studies on change and profile analysis depend, among other things, on the relia- 
bility of difference scores. 



Think About It 3.5 The data in this example also come from the study on 
attitudes and behaviors of students related to their sexual activities. However, 
instead of summing the scores on two scales, the composite score is now the 
difference (gain) from pretreatment to posttreatment measurements on a 
scale labeled "Self-affirmation"; that is, Y = X 2 - X x , where A", is the pretreat- 
ment score and X 2 the posttreatment score on this scale. With the study data, 
the variance of the difference Y 7 = z 2 - z x (where z x and z 2 are the standard 
score values for X x and X 2 ) was found to be rjy Z = 0.786. 

The reliability coefficients {alpha coefficients) for X x and X 2 were r, , = 
0.8282 and r 22 = 0.8374, respectively. Using Equation 3.14, the reliability of 
the difference scores is 

2 -(0.8282 + 0.8374) 



'\\ 



= 1 



0.786 



= 0.575 



Evidently, the reliability of the difference score (0.575) is smaller than 
the reliability of the scores entering the difference (0.8282 and 0.8374). As 
noted earlier, the reliability of the difference score, r YY , is (implicitly) influ- 
enced by the correlation between X x and X 2 (in this case, r 12 = 0.606), be- 
cause this correlation affects the value ofOy, in Equation 3.15. 



Reliability 119 



The use of difference (gain) scores in measurement of change has been 
criticized because of the (generally false) assertion that the difference between 
scores is less reliable than the scores themselves (e.g., Cronbach & Furby, 
1970; Linn & Slindle, 1977; Lord, 1956). This assertion is true, however, if 
the pretest scores and the posttest scores have equal variances and equal relia- 
bility. When this is not the case, which may happen in many measurement 
situations, the reliability of the gain score is reasonably high (e.g., Overall & 
Woodward, 1975; Zimmerman & Williams, 1982). The relatively low relia- 
bility of gain scores does not preclude valid testing of the null hypothesis of 
zero mean gain score in a population of examinees, but it is not appropriate 
to correlate the gain score with other variables for these examinees. An im- 
portant practical implication is that, without ignoring the caution urged by 
some authors, researchers should not always discard gain score and should be 
aware when gain scores are useful. 



Reliability of Weighted Sums 



When different components are of varying importance, but need to be combined 
into a composite score, the components must first be "weighted" before being com- 
bined. Let the scores from two tests, X x and X 2 , have different "weights" (w x and w 2 , 
respectively) in a composite score, Y= w x X x + w 2 X 2 . To estimate the reliability of the 
composite score, Y, given the reliabilities of X x andX 2 , one can (for simplicity) use 
the weighted composite score, Yz, of the standardized variables Z x and Z 2 , which are 
obtained by transforming the raw scores of X x and X 2 into z-scores. That is, 

Y z = w x X x + w 2 X 2 . 



With this, the reliability of the composite score, Y(or Y z ), is given by Equation 



3.16: 



r YY=l- 



1-r, , W+ll-r-v 



'YZ 



(3.16) 



where ryyis the reliability of the composite score F(or Yz), r xx is the reliability ofXj, 
r 22 is the reliability of X 2 , and Gyz is the variance of the composite score Yz (the 
weighed sum of Z x and Z 2 ). 



Think About It 3.6 The examination score of counseling students in a 
lifespan development course is obtained as a composite score of midterm and 
final examinations, with 40% importance assigned to the midterm and 60% 
importance to the final examination. The task is to estimate the reliability of 
the composite score. 

The reliability estimates (Cronbach's alpha coefficients) for the scores on 
the first test, X x (midterm), and the second test, X 2 (final), are r xx = 0.72 and 
r 22 = 0.80, respectively. Given that the weight for^ is w x = 0.4 (40% impor- 
tance) and the weight for X 2 is w 2 = 0.6 (60% importance), the composite 



1 20 Chapter 3 



score is: Y= (0.4)^ + (0.6)X 2 - After rransforming rhe scores on X x and X 2 
into z-scores to obtain the standardized variables Z x and Z 2 , respectively, the 
variance of Y z = (0.4)Z ( + (0.6)Z 2 is found to be Gy Z = 1.27. Using Equation 
3.16, the reliability of the composite score Fis then 

2-(l-0.72)(0.4) 2 + (l-0.80)(0.6) 2 



'YY 



= 1-- 



1.27 



0.908. 



Equation 3.16 can be easily extended to estimate the reliability of a 
weighted sum of the scores on more than two tests. In the case of three tests, 
for example, the reliability of the composite score Y= u>\X x + w 2 X 2 + WyX^ 
can be obtained by extending Equation 3.16 to Equation 3.17: 



(l-/,, )w ] 2 +(l-r 22 )w\+(\- 



r 33 



)wj 



'YY 



J YZ 



(3.17) 



where Oyz is the variance of Yz = u> l Z l + w 2 Z 2 + w^Zy Equations 3.16 and 
3.17 (as well as their extensions for more than three tests) apply equally well 
when some of the weights are negative numbers. 



SUMMARY/CONCLUSION 



This chapter has introduced the concept of reliability, types of reliability, different 
methods of estimating reliability, and principles in interpreting and comparing reli- 
ability coefficients. Generally, reliability of measurements (e.g., test scores and sur- 
vey ratings) indicates their accuracy and consistency under random variations in 
measurement conditions, such as a person's conditions (e.g., fatigue or mood) and/or 
external sources (e.g., noise, temperature, different raters, and different test forms). 

In classical test theory, the true score of a person is defined as the theoretical 
mean of the observed scores that this person may have under numerous independ- 
ent testings with the same test. A basic assumption is that the examinee's observed 
score is a sum of the person's true score and an error (X= T '+ E). Tests with equal true 
scores and equal error variances, for any population of examinees, are referred to as 
parallel tests. The reliability of test scores is equivalently defined as (a) the correlation 
between observed scores on parallel tests, (b) the ratio of true score variance to ob- 
served score variance for the same test, or (c) the squared correlation between ob- 
served and true scores. Standard error of measurement {SEM) is the standard deviation 
of the (assumed normal) distribution of the difference between examinees' observed 
scores and their true scores. 

Five types of classical reliability were discussed in this chapter: internal consis- 
tency, test-retest reliability, alternate form reliability classification consistency, and 
interrater reliability. 

Internal consistency estimates or reliability are based on the average correlation 
among items within an instrument. If the instrument consists of different scales, in- 
ternal consistency should be estimated lor each scale. Widely used estimates of inter- 
nal consistency are the split-hall reliability coefficient and Cronbach's coefficient 
alpha (or its equivalent version, KR-20, for dichotomously scored items). It is always 



KEY TERMS 



Reliability 121 

useful to report the internal consistency of test scores even when other types of reli- 
ability are of primary interest. With speed tests, however, it would be misleading to 
report estimates of internal consistency. 

Test-retest reliability indicates the extent to which persons consistently respond to 
the same test, inventory, or questionnaire administered on more than one occasion. 
It is estimated by the correlation between the observed scores of the same people tak- 
ing the same test twice {coefficient of stability). The major problem with test-retest re- 
liability estimates is the potential for carryover effects between the two test adminis- 
trations (e.g., due to biological maturation, cognitive development, changes in 
information, experience, and/or moods). Thus, test-retest reliability estimates are 
most appropriate for measurements of traits that are stable across the time period be- 
tween the two test administrations (e.g., personality or work values). 

Alternate form reliability relates to the consistency of scores on two alternate test 
forms administered to the same group of individuals. It is estimated by the correla- 
tion between observed scores on two alternate test forms, referred to also as coefficient 
of equivalence. Estimates of alternate form reliability are also subject to carryover ef- 
fects. A recommended rule of thumb is to have a 2-week time period between ad- 
ministrations of alternate test forms. 

Criterion-referenced reliability shows the consistency with which decisions about 
mastery-nonmastery of a specific objective (criterion) are made, using either the 
same test administered on two occasions or alternate test forms. Widely used classi- 
cal indices of classification consistency are the observed proportion of consistent clas- 
sifications, P Q , and Cohen's kappa coefficient, which takes into account consistent 
classifications that may occur by chance. 

Interrater (or interscorer) reliability refers to the consistency (agreement) in sub- 
jective judgments of raters (experts, judges) used for classification purposes (e.g., to 
determine a "minimum level of competency" in pass-fail decisions) or scoring rubrics 
in alternative assessments (e.g., portfolios, projects, and products). Depending on 
the measurement case, frequently used estimates of interrater reliability are correla- 
tion coefficients, P Q , and Cohen's kappa coefficient (or kappa-Yike coefficients). 

Often the person's scores from two or more scales of some instruments are com- 
bined into composite scores to measure and interpret a more general dimension (trait 
or proficiency) related to these scales (i.e., achievement, intelligence, aptitude, de- 
pression). Although the composite score may be simply the sum of several scale 
scores, its reliability is usually not just the mean of the reliabilities for the scales being 
combined. In this chapter, the reliability for composite scores is addressed for cases 
when the composite score is a sum (or difference) of scale scores or a weighted sum 
of scores. 



alternate form reliability interscorer reliability 

confidence interval normal distribution 

internal consistency observed score 

interrater reliability random error 



1 22 Chapter 3 



reliability systematic error 

speed test test-retest reliability 

split-half reliability true score 
standard error of measurement 



VALIDITY DEFINED 




CHAPTER 



4 



Validity 

by Alan Basham and Bradley T. Erford 



This chapter focuses on the concept of validity of scores in testing and assess- 
ment. It examines how reliability and validity are related and distinct, the dif- 
ferent methods by which evidence for validity can be established, and key prin- 
ciples professional counselors should apply in determining whether a test is 
appropriate for use with a client or group. Methods for making accurate decisions 
using a single test or multiple tests are also discussed. 



While reliability indicates the degree to which scores on an instrument are measured 
consistently, validity considers the degree to which test scores measure what the test 
claims to measure. In both cases, test developers attempt to amass evidence that in- 
dicates, either logically or through probability, that test scores are trustworthy. In re- 
liability, test scores are trustworthy to the degree that they reflect an accurate assess- 
ment of some trait or ability, minimizing randomly occurring testing error. Evidence 
for validity, however, is concerned with verifying exactly what the test is measuring. 
Test results can be trusted, not just because they can be measured consistently, but 
because they measured what they were supposed to measure. 

Suppose you and a group of friends had an opportunity to demonstrate your 
skills at an archery range. Supplied with a bow and several arrows, you each fired at 
targets the same distance away. Some people's arrows hit the target, others careened 
off nearby trees and rocks, and one person nearly skewered the instructor with a sin- 
gularly wild shot. Only you hit the bulls-eye five shots in a row. The instructor, 
thinking you might have been just lucky, gives you five more arrows, all of which 



123 






1 24 Chapter 4 



FACE VALIDITY 



you calmly sink into the center of the target. Clearly, you are the most reliable archer 
in the group, because you keep getting the same result over and over. That's reliabil- 
ity, of course. However, if the amazed instructor asks you how you came to be so ac- 
ademically gifted, you might be well advised to question the instructor's judgment. 
Why? Because your demonstrated consistency at archery has little or nothing to do 
with the concept of academic giftedness. Imagine a scholarship program that 
awarded grants for tuition in counselor education based on consistency and profi- 
ciency of archery scores. Your archery score may be consistent (and therefore reli- 
able) but is probably not a reasonable measure of academic potential. The meaning 
of the consistent, repetitive bulls-eyes, then, has become a question of validity. 

So, the validity of test scores is about two things: (1) what the test actually meas- 
ures and (2) how well the test scores measure it (Anastasi & Urbina, 1997). Some 
common methods for establishing evidence for validity are described in the 
Standards for Educational and Psychological Testing (AERA/APA/NCME, 1999). The 
Standards identified three major types of evidence for validity: content-related, cri- 
terion-related, and construct-related. While each of these types is distinct in its ap- 
proach to demonstrating score validity, it is important not to assume that they are 
unrelated to each other. In fact, many test authors use more than one of these tech- 
niques to support the validity of test scores. Much of this chapter is devoted to out- 
lining these techniques and providing examples. Although face validity is no longer 
generally accepted as a legitimate form of validity assessment, a brief discussion of 
this type of validity follows. 



Face validity is derived from the obvious appearance of the measure itself and its test 
items. Items in instruments marked by face validity ask directly for information that 
is expected and wanted by the test user. Face validity is quite appropriate for survey 
instruments in which the person being queried is responding to questions such as 
"What is your age?" or "What is the highest level of education you completed?" 

A major problem with self-report tests with high face validity is that when the 
trait or behavior in question is one that many people will not want to reveal about 
themselves, the likelihood of a truthful (and therefore valid) answer is minimal. A 
well-known example of the problem with face validity is that of the Woodworth 
Personal Data Sheet (Woodworth, 1920). This first structured personality test was 
developed during World War I for use in screening applicants for the military. 
Designed to standardize the psychiatric interview, it was based on the incorrect as- 
sumption that the content of an item and people's truthful response to it could be 
taken at face value. The assessment device included questions such as "I wet the bed" 
and "I drink a quart of whiskey every day," to which the person was asked to respond 
yes or no. The false assumption that people would answer such questions truthfully 
and that they interpreted the questions the same as everyone else essentially made 
the test results untrustworthy (Kaplan & Saccuzzo, 2001). Because of these limita- 
tions, the Standards (AERA et al., 1 l ) 1 ) 1 )) does not include face validity as a legitimate 
type of validity in psychological assessment. 1 lowever, the above information is note- 



Validity 1 25 

worthy because professional counselors may see the term in other documents. Even 
so, for tests in certain domains (e.g., achievement, intelligence), face validity can add 
credibility or acceptance to the assessment process. 



CONTENT-RELATED VALIDITY 



Content-related validity is widely used in educational testing (Kaplan & Saccuzzo, 
2001) and in tests of aptitude or achievement. It is used in achievement tests to de- 
termine how well an individual has mastered a skill or the content of a course of 
study (Anastasi & Urbina, 1997). The main focus in content-related validity is on 
how the instrument was constructed and how the content of the test was determined 
(Whiston, 2005). The focus on content reflects the examiner's concern with how 
well the test items reflect the domain of the material being tested. The term domain 
refers to the total informational field from which the items are drawn. 

For example, a teacher of U.S. history could write an exam to assess students' 
knowledge of the Civil War. The domain of information from which test items 
would be drawn is composed of the dates, battles, important persons, sociopolitical 
and economic factors, and causes of the war itself. The test would have validity to the 
effect that its content reflected all the important aspects of the domain of Civil War 
knowledge. A test that asked only about specific battles but ignored persons, causes, 
and political outcomes would hardly yield valid test scores of one's comprehensive 
knowledge of the U.S. Civil War. 

Determining the content validity of a test requires a systematic evaluation of the 
test items to determine whether adequate coverage of a representative sample of the 
content domain was measured (Anastasi & Urbina, 1997). Obviously, the test can- 
not ask questions about all the information in the domain, but it should contain 
some items that assess knowledge of each of the domains areas or categories. The do- 
main itself should be examined to make sure that all major aspects are covered by 
the test, and the test should be constructed so that the number of items from each 
category within the domain is consistent with the size and importance of that cate- 
gory. Demonstrating how the test is constructed to represent the content of the do- 
main provides evidence of content-related validity. The following is another exam- 
ple illustrating the concept. 

Most professional counselors have taken a graduate course in counseling theo- 
ries. Imagine an exam (much like one you have probably come across yourself in your 
academic journey) that covered the counseling theories of Freud, Adler, Jung, Ellis, 
and Rogers. (Please note that the number of theorists is limited here for the sake of a 
manageable example). To create a test that assessed knowledge of these progenitors 
and their contributions to the field, the professor would first analyze the important 
content areas of the domain. Let's assume the professor divided the overall domain of 
each of these five therapeutic pioneers into five subcategories identifying the salient 
content of each, including theoretical underpinnings of the model, therapeutic tech- 
niques, history of the founder, differences between each model and the others, and 
important terms and concepts unique to the model. The professor's organized analy- 
sis of the domain would look something like that contained in Table 4.1. 



1 26 Chapter 4 



Table 4.1 Content analysis of important information 
regarding five counseling theorists 



Freud 



Adler 



Jung 



Ellis 



Rogers 



Theory 
Techniques 
History 
Differences 
Terms, i.e., 
Id, ego, superego 



Theory 
Techniques 
History 
Differences 
Terms, i.e., 
inferiority complex 



Theory 
Techniques 
History 
Differences 
Terms, i.e., 
archetypes, shadow 



Theory 

Techniques 

History 

Differences 

Terms, i.e., 

catastrophizing, 

A-B-C-D-E 



Theory 

Techniques 

History 

Differences 

Terms, i.e., 

unconditional 

positive regard 



The professor would then write items reflecting each of the 25 categories listed 
above and select items from each category. If the items of the test adequately assessed 
some knowledge of each area of the domain, the test would have content-related va- 
lidity. However, if the professor asked questions only about the terms of Jungian psy- 
chology, the history of Sigmund Freud's life, and the techniques of Rationale- 
Emotive Behavior Therapy (REBT) (Ellis), the test items probably would not be 
valid measures of the content under study because the questions did not adequately 
reflect knowledge of the domain being considered. 



CRITERION-RELATED VALIDITY 



Criterion-related validity is derived from comparing scores on the test to scores on 
a selected criterion. What is a criterion? It is a person's performance score on activi- 
ties the test is designed to predict. Specifically a sample of participants in the valida- 
tion study has two scores that may be correlated with each other. One is the person's 
score on the test being studied, and the other is a score indicating the person's actual 
level of ability in the skill or behavior under question as measured by some criterion. 
The Scholastic Assessment Test (SAT) and Graduate Record Exam (GRF), for exam- 
ple, are used to predict performance in college and graduate school, respectively The 
criterion measure for each of these tests is actual academic performance as measured 
by grade point average at some point later in the students' academic career. Similarly, 
the Armed Services Vocational Aptitude Battery (ASVAB) (USMEPCOM, 2005) is de- 
signed to identify the occupational specialties in which military personnel will be 
most skilled, given the proper level of training. Job performance in the military is 
the criterion measure for the ASVAB. 

Anastasi and Urbina (1997) delineate several sources of criterion scores: 

■ Academic achievement, such as school grades and achievement test scores. 

■ The amount of education a person has. 

■ Performance in specialized training, such as music, accounting, or flying airplanes. 

■ Job performance, including in business, industry, and the military. 

■ Psychiatric diagnosis, which is used especially in development of tests measuring 
personality and psychopathology. 



Validity 127 

■ Ratings by job supervisors, teachers, and others in a position to evaluate the per- 
formance effectiveness of subordinates. 

■ Correlations with a previously available test, especially when the new test is a sim- 
pler form of the original test. 

There are two forms of criterion-related validity, predictive criterion-related 
validity and concurrent criterion-related validity. The main difference between 
the two is when the criterion measure is taken. In predictive criterion-related va- 
lidity, the test is administered first, and scores on the criterion measure are col- 
lected on the same sample of persons at a later date (i.e., some time in the future). 
In concurrent criterion-related validity, the scores on the test and criterion meas- 
ure are collected at the same point in time. Let's consider examples of each form 
of criterion-related validity. 

Suppose that a professional counselor has been asked by a local business owner, 
Ms. Schmidlapp, to help her make more accurate hiring decisions at her factory, the 
Schmidlapp Widget Company. Ms. Schmidlapp wants the professional counselor to 
construct a test that will enable her to select those job applicants who will be most ef- 
fective at widget assembly. The professional counselor develops a test believed to help 
her make the right choices and conducts the necessary studies to determine that, in 
fact, the test scores are quite reliable. However, the professional counselor does not 
yet know whether the test scores are valid measures of one's potential as a widget as- 
sembler. The professional counselor gives the next 100 job applicants the test, Ms. 
Schmidlapp hires them all on a three-month probationary status, and three months 
later each new employee is observed to identify the number of flawless widgets assem- 
bled in one week. Each employee's score on the test (predictor) is correlated with the 
employee's (score on) widget assembly proficiency (criterion). The direction and mag- 
nitude of the correlation between predictor and criterion variables tells the profes- 
sional counselor the degree to which the test is associated with assembly skill. Because 
the criterion measure was collected some time later than the predictor, this study 
measured the test's predictive criterion-related validity. 

Of course, there are some problems with using this form of validity assessment. 
First, the employer has to hire all the applicants in the pool of 100, regardless of abil- 
ity or test scores, so that the predictor test accuracy will not be compromised by re- 
stricted range. If Ms. Schmidlapp hires only "qualified applicants," she will have cri- 
terion scores only on qualified applicants. How will she know if the test will identify 
unqualified applicants if the sample will contain no unqualified applicants? 
However, hiring everyone can create some major costs in terms of lost productivity 
and dissatisfied customers who receive faulty widgets. Thus, the delay between col- 
lection of predictor and criterion measures means that the problem the test was de- 
signed to resolve continues for that length of time. Second, conducting a time-de- 
layed study creates the risk of attrition, in which one may lose some of the original 
sample (and their criterion scores) because they quit the job, go on sick leave, or are 
rapidly promoted to management. 

Concurrent criterion-related validity solves some of these problems but creates oth- 
ers. In this scenario, the professional counselor creates the test and conducts reliabil- 
ity studies, just as explained above. Then the professional counselor administers the 



1 28 Chapter 4 



test to all current employees who assemble widgets and assesses their level of pro- 
ductivity at the same time. Finally, the professional counselor correlates the scores to 
determine the relationship between scores on the test and concurrent efficiency of 
widget assembly. As before, if high scores on the test are associated with high efficiency 
at widget assembly and low scores with low proficiency, the professional counselor has 
established criterion-related evidence for validity, and Ms. Schmidlapp will probably 
give the professional counselor a bonus. However, the major problem with the test 
scores is that they are likely afflicted with a restricted range in the sample of current 
employees. If the test is supposed to identify which job applicants have an aptitude for 
widget assembly and which do not, how do we know it will do so when the criterion 
measure is derived only from those who can assemble widgets, as evidenced by their 
employment? That is, Ms. Schmidlapp has probably already rid her employee pro- 
duction line of inefficient widget assemblers, some, no doubt, reassigned to manage- 
ment. The advantage of the concurrent method, though, is that there is no long delay 
in the construction of the test, with all its real-world adverse effects, and no risk of 
attrition. 

Perhaps you can readily see how this same scenario would apply to the con- 
struction of aptitude and achievement tests as predictors of future performance. 
With the SAT, for example, one could give the test to a group of high school stu- 
dents and later correlate each student's score to the student's college grade point 
average. To be most accurate, though, all the high school students should be ad- 
mitted to college, preferably the same college. This presents obvious problems. One 
could also give the test to a group of current college students and compare their 
test scores to their college grade point averages. The problem, of course, is that of 
restricted range, again; only college students are in the sample, but the test is in- 
tended to be used with those who are still in high school. The above examples are 
of predictive and concurrent criterion-related validity, respectively. 

To determine with even greater certainty how valid test scores are and how ac- 
curately each predicts future behavior, one can develop a prediction equation repre- 
senting the relationship between the predictor and criterion measures and then cal- 
culate the standard error of estimate in one's predictions. 



Standard Error of Estimate 



A correlation coefficient represents the relationship between two variables (A" and 
Y). A correlation of 0.87 means the same thing, no matter what variables A" and Y 
are. Their relationship in this case is a positive one; as X increases, ^increases. High 
scores on A' are associated with high scores on Y; low scores on Aare associated with 
low scores on Y. Squaring the correlation coefficient produces the coefficient of de- 
termination (r~), the amount of variability in X that is accounted for by the vari- 
ability in Y. 

Recall also that the relationship between A" and Kcan be represented by a regres- 
sion equation: 

Y=a + bX (4.1) 



Validity 1 29 

This equation is the algebraic formula that indicates both the slope and inter- 
cept of the line that is closest to all the data points in a scatter diagram or bivariate 
data plot. The intercept (a) is the point at which the line crosses the vertical y axis. 
The intercept may also be defined as the value of Fwhen X = 0. The slope (b) is the 
amount ^increases when X increases by 1.0. Like the correlation coefficient, the 
slope and intercept are calculated using the scores in the Xand ^distributions. Once 
these statistical values are determined, the equation for the regression line can be 
used to predict the value of Fwhen we have a value of X for a given person. 

For example, consider a prediction equation derived from two variables, the pre- 
dictor {X) and criterion {Y): Y = 5.0 + 0.7X. We can use this quantified relationship 
between Xand Tto predict a person's eventual performance on Y (Y') using their 
score on our test, X. A person whose score on X = 30 would have a predicted value 
of 26 on Y. [Y = 5.0 + (0.7 x 30) = 26]. 

This process of prediction is widely used in education, business, and industry. 
Standards called cutoff scores are often set by those making decisions about hiring, 
promotion, and admission to educational and occupational training opportunities. 
Those whose predicted performance on the criterion variable is below the cutoff 
score are not likely to be selected, while those attaining the highest scores are. 
Because decisions affecting people's lives are made using their scores on predictor 
variables, it is imperative to have the most accurate tests we can and to know just 
how accurate a given test is. The method used to determine the accuracy of predic- 
tion is the standard error of estimate. 

The standard error of estimate (SE est ) is derived from examining the difference 
between our predicted value of the criterion (Y) and the person's actual score on the 
criterion (Y). This difference is known as prediction error or residual. Recall that all 
test scores contain some degree of random error, and that a reliable test is one that 
produces scores that are mostly truth with little error. However, there is no such 
thing as a perfectly reliable test. Further, both our predictor and criterion measures 
are imperfect, despite our best efforts. Knowing this, it is certain that, even with the 
best of criterion and predictor measures, we are destined to be inaccurate to some 
degree when estimating future scores using a prediction equation. Fortunately, SE est 
enables us to determine how accurate test scores are likely to be. 

The easiest way to understand SE est is to reflect on the concept of the standard 
deviation of a sample of scores. The standard deviation is the average amount of dis- 
tance between a given score and the sample mean in a distribution of scores. Using 
the standard deviation, we can determine how far away from the mean the scores 
tend to be, and thus how accurately the mean represents the sample scores as a meas- 
ure of central tendency. A large standard deviation indicates that the scores are spread 
widely around the mean; a small standard deviation indicates less variability because 
the scores tend to be clustered around the mean. 

The standard error of estimate operates in a similar fashion, quantifying the av- 
erage distance between predicted scores and persons' actual scores on the criterion. 
A large SE est indicates that we are typically not very accurate in our predictions of a 
person's eventual performance on the criterion measure. This means that, however 
noble our intentions, our test is not a very good one, at least for this purpose. 



1 30 Chapter 4 



However, if the SE es[ is small, our predictions, though not perfect, on average are 
coming close to the person's eventual performance on the criterion measure. The for- 
mulas for the standard error of estimate are 



S ( r-ry or (42) 

N-2 



SE„=5, > /l-r X Y 2 (4-3) 

In Equation 4.2, each person's predicted score (Y) is subtracted from the per- 
son's criterion score (K). This residual is then squared for each person, and all the 
squared residuals are added up. This numerator is divided by a denominator, the 
value of which is the number of persons in the sample minus two (N — 2). Take the 
square root to attain SE esI . Equation 4.3 multiplies the square root of 1 minus the 
square of the validity coefficient (ryy 2 ) times the standard deviation of the criterion 
scores (s Y ) in the validity study. 

The SE est can be used to identify the overall level of accuracy of predictions 
by referring to a table of areas under the normal curve. For example, the area of 
the normal curve that lies between a z-score of 1 .96 and -1 .96 is 0.95, or 95% of 
the area. In a normal distribution of scores, 95% of the scores fall between these 
points. Similarly, because errors of prediction are random, 95% of criterion scores 
lie within 1.96 standard scores (1.96 x SE est ) of their predicted value. Consider a 
distribution of predicted scores with a SE est of 2.0. If a person's predicted score (Y') 
on the criterion measure was 30, simple arithmetic would indicate a 95% proba- 
bility that the person's actual eventual criterion score would be somewhere between 
33.92 [30 + (1.96 x 2.0) = 30 + 3.92 = 33.92] and 26.08 [30 - (1.96 x 2.0) = 
30 - 3.92 = 26.08]. Whether this is accurate enough is a judgment call embedded 
with ethical ramifications made by those using the test against cutoff scores. 

To conclude this section, let's return to the Schmidlapp Widget Company with 
the prediction equation (V = 5.0 + 0.7A) and standard error of estimate (SE est = 
2.0). Ms. Schmidlapp has informed you that, on average, her employees must be 
able to assemble 40 widgets per week to keep the company solvent. To be on the safe 
side, Ms. Schmidlapp determines that no applicants should be hired whose predicted 
score on the criterion (Y) is less than 40. What cutoff score on the test should the 
professional counselor recommend? Substituting the available values into the predic- 
tion equation, 40 = 5.0 + 0.7X and using simple algebra procedures, it is determined 
that X= 50. Thus, the professional counselor recommends that Ms. Schmidlapp hire 
only those applicants who score 50 or higher on the test, understanding that some 
will produce fewer than 40 widgets and some will produce more than 40. In fact, 
Ms. Schmidlapp can be 95% certain that all applicants with a test score of 50 will 
produce somewhere between 36.08 and 43.92 widgets each week. Of course, be- 
cause the test is imperfect (as all tests are), Ms. Schmidlapp will hire a few applicants 
who will not perform adequately and will not hire others who could have made nu- 
merous magnificent widgets. Also, if Ms. Schmidlapp needs to boost profits at some 
point, she always can raise the present minimum acceptable score of 50 to a higher 
score. 



Validity 131 



Think About It 4.1 As an example of how to calculate the Standard 
Error of Estimate (SEE), assume that for the T score scale (M = 50 and 
SD = 10) of the Couriers' Adult ADHD Rating Scales (CAARS) Diagnostic 
and Statistical Manual of Mental Disorders — Fourth Edition (DSM-IV) 
Inattention subscale, the score reliability for a sample of clients is 0.91. 
Using Equation 3.5 for the SEM and Equation 4.3 for the SEE, we obtain: 

SEM = 10Vl-0.91=3.00 and SEE = 10^/(0. 91)(l-0. 91) =2.86. 



CONSTRUCT VALIDITY 



Evidence for construct validity is established by defining the construct being meas- 
ured and by gradually collecting information over time to demonstrate or confirm 
the meaning of what the test measures (Kaplan & Saccuzzo, 2001). Construct valid- 
ity is widely used in assessment of theoretically defined domains, such as personality 
traits, psychological disorders, and intelligence. In each case, the test author carefully 
defines the construct under consideration, then designs a test to measure it, and col- 
lects evidence supporting the validity of the test as a measure of the construct. The 
principal means by which construct validity is established include convergent evi- 
dence, discriminant evidence, factor analysis, meta-analysis, developmental changes, 
and distinct groups (Whiston, 2005). 

Convergent validity evidence is gathered by correlating the scores on a test with 
scores on other tests believed to measure the same or very similar constructs. High 
positive correlations are evidence of convergent validity, in that scores on the two 
tests converge on each other, pointing toward the same psychological characteristic. 
For example, both the Minnesota Multiphasic Personality Inventory— Second Edition 
(MMPI-2) (Butcher et al., 2001) and the California Psychological Inventory (CPI) 
(Gough & Bradley, 1996) have scales that measure the construct "Dominance." 
High scores on both scales would be convergent validity evidence. A strong negative 
correlation with a scale that measures the same trait using a reversed scaling method 
or measures an opposite trait would also indicate convergence. For example, valid 
scores on a scale on dominance should be expected to correlate negatively with a 
scale that measures passivity. 

Discriminant validity evidence is derived by demonstrating that test scores are 
not highly correlated with measures of other, unrelated constructs. A personality 
scale that accurately measures self-esteem should not correlate highly with a measure 
of extraversion, though high levels of self-esteem may be associated with social par- 
ticipation in some people. Introverts with high self-esteem, however, will not be as 
likely to engage in social activity with people they do not know well. To be a distinct 
measure of self-esteem, scores on the instrument in question should not be impacted 
by the introversion or extraversion of the test taker, theoretically speaking. Low cor- 
relations between measures of these unrelated constructs provide evidence of dis- 
criminant validity. Discriminant and convergent techniques are especially important 
in the validation of personality tests (Anastasi & Urbina, 1997). 



132 Chapter 4 



Think About It 4.2 How is a combination of evidence of convergent 
and discriminant validity useful in determining the overall validity of test 
scores? 



Factor analysis conducts a complex statistical evaluation to determine the degree 
to which the items contained in two separate instruments tend to group together 
along factors that mathematically indicate similarity, and thus a common meaning. In 
addition, factor analysis can determine to what degree the subscales of two tests are 
similar to each other, as indicated by their lining up together on factorial vectors 
(Whiston, 2005). For example, subscales measuring dominance, sensitivity, or toler- 
ance should line up with similar scales on another test if the evaluated scores are valid. 

Meta-analysis considers the results of a number of validation studies, combining 
the results to identify an overall effect, if one exists. Synthesizing the results of numer- 
ous validity studies can demonstrate strong evidence for the validity of a given test. 

Developmental changes indicate support for the construct validity of a test 
when the test measures changes that are expected to occur over time. For example, 
we may be interested in measuring the thinking processes of children in light of 
Piaget's model of cognitive development. A valid test would discriminate between 
concrete operations and formal operations and would show increased levels of for- 
mal operations thought among young people as they moved from childhood to ado- 
lescence, as is expected developmentally. More generally, older children would be ex- 
pected to obtain higher raw scores on intelligence or achievement tests than younger 
children. Note that developmental age or grade changes are necessary but not suffi- 
cient conditions for establishing construct validity; that is, achievement test scores 
had better become higher as children get older, or the test developer has some real ex- 
plaining to do. 

Distinct groups can provide evidence of construct validity if their scores are dif- 
ferent in an expected direction from scores of people in other groups or the general 
population. If we had a test designed to measure leadership, we would expect a group 
of military officers to score higher, on average, than the general population. Because 
the identified distinct group is logically assumed to possess the characteristic in ques- 
tion, one expects them to score high on the test. The degree to which they do indi- 
cates the extent to which the test measures leadership. 

In conclusion, it is important to remember the crucial step of defining the 
construct carefully before attempting to demonstrate the validity of an assessment 
instrument. There are many tests that measure intelligence, self-esteem, depres- 
sion, and marital compatibility, to name just a few constructs. No two tests are 
necessarily measuring the same construct just because they use the same name for 
that construct. Referring back to an earlier example, both the MMPI-2 and CP1 
have scales measuring the personality trait of "Dominance" (Duckworth & 
Anderson, 1995; Gough & Bradley, 1996). The CPI's scale defines a dominani 
person as "being strong in face-to-face situations and as being able to influence 
others, to gain their automatic respect, and, if necessary, to control them" (Gough 



Validity 1 33 

& Bradley, 1996, p. 76). The MMPI-2 identifies its Dominance scale as "a fairly 
simple measure of a person's ability to take charge of his/her own life" (Duckworth 
& Anderson, 1995, p. 340) and as measuring "poise, self-assurance, resourceful- 
ness, efficiency, and perseverance" (p. 341). Note that the MMPI-2 Dominance 
scale has no indication of a desire to influence or control others, while the CPI's 
scale does. In fact, the MMPI-2 scale indicates the desire to influence others only 
when other scales are elevated. Both scales carry the name "Dominance," but they 
do not measure identical constructs. 

Finally, keep in mind that the definitions of various constructs change as soci- 
ety evolves and knowledge changes over time. Consider the emergence of emotional 
intelligence (Goleman, 1995), a construct derived from research in behavior and the 
processes of the brain, but not specifically measured by any of the major intelligence 
tests currently in use. 



THE INTERACTION OF RELIABILITY AND VALIDITY 



Quite simply, a test can never be more valid than it is reliable. Recall that a reliable 
test score is a mostly true estimate of a person's actual ability or characteristic, with 
only a little error contained in the test. If a test score is mostly composed of testing 
error, it cannot possibly be mostly composed of accurate assessment of the construct 
or ability in question. Stated another way, because unreliable test scores do not meas- 
ure accurately and/or consistently, it is difficult to demonstrate that they measure 
any particular construct or ability accurately and consistently. It is possible to have a 
reliable test without knowing exactly what it measures. Whatever it measures, a reli- 
able test does so consistently. Logically, though, it is not possible to have valid test 
scores that are unreliable. 

The reliability of predictor and criterion measures in criterion-related assess- 
ment is also an important factor in determining test score validity. Equally important 
is the reliability of comparison instruments used in convergent and discriminant 
construct-related validity and in factor analysis. Using instruments with low reliabil- 
ity in an effort to compile validity data on a test of interest inevitably introduces 
error into the resultant validity coefficients. 



VALIDITY AND TESTING PRACTICE 



Test validity is important because decisions about which test to use and conclusions 
as to what scores indicate about clients are derived from our understanding of what 
the test measures. Following are some important considerations when using a partic- 
ular test with clients: 

■ Because a test cannot be more valid than it is reliable, always become familiar 
with the reliability of test scores, including the methods by which evidence for re- 
liability was established. 

■ Consider the size and makeup of the samples used in reliability and validity stud- 
ies. As in other forms of research, smaller samples make it less reasonable to gen- 
eralize results of the study to the population (Harris, 1998). If at all possible, the 



1 34 Chapter 4 



norming samples should be representative of the clienr(s) with whom you plan 
to use the test. If it is not, use caution in interpretation, taking into considera- 
tion the reality that factors other than those the test is designed to measure may 
be affecting your client's score. 

■ Examine any test you use for biased items. Items may be more familiar to some 
identifiable groups of people than others. For example, a test item picturing a 
winter snow scene may be perceived differently by those who grew up in tropi- 
cal climates than by those whose winters were routinely snowy. 

■ Language is a significant contributor to potential bias, especially if the test is 
written in a language in which the test taker is not proficient. Use caution in ap- 
plying scores from tests that place the client at a disadvantage due to linguistic 
differences. 

■ Ethnicity can be a source of response variation in testing. Cultural differences 
can lead to different outcomes on a personality test, for example, even when lan- 
guage difference is not an issue. One culture's definition of appropriate behavior 
can be very different from another's, leading to erroneous assumptions about an 
individual's personality that actually emerge from cultural norms. 

■ Do not assume that the name of a test or scale accurately reflects the actual mean- 
ing of the test score. Always read the test manual to determine the exact defini- 
tion of the skill or construct being measured. 

■ Where possible, use more than one test or scale to increase the accuracy of as- 
sessment. Using more than one predictor increases the likelihood of correctly 
predicting a client's outcome score. Using more than one personality assessment 
provides more complete information about the trait under consideration, espe- 
cially if the tests purport to measure the same construct. 

■ Tests are not proven to be valid. The validity of a particular test score for use with 
a particular client under the circumstances at hand is a judgment call made by 
the professional counselor based on the amassed evidence supporting the test's 
validity and defining its meaning. Because professional counselors should use 
tests only with the intent of being helpful to the client, ask if this is the right test 
for the right client for the right reasons. 

THE APPLICATION OF VALIDITY: 
DECISION MAKING USING TEST SCORES 

The primary purpose behind administering psychological and educational tests is to 
help make accurate decisions that will benefit clients and students. Psychometricians 
and statisticians have developed a number of procedures for making decisions using 
a single test and multiple tests. 

Decision Making Using a Single Score 

By definition, decision making using a single test is relegated to the realm of a screen- 
ing procedure. There are three popular procedures for single-score decisions: deci- 
sion theory, linear regression, and setting a cutoff score. 



Validity 1 35 

Decision theory 

Decision theory (Anastasi & Urbina, 1997) involves the collection of a screening 
test score and a criterion score, either at the same point in time (i.e., concurrent de- 
cision) or at some point in the future (i.e., predictive decision). Some common ex- 
amples of concurrent decisions would be virtually any clinical or diagnostic study in 
which a screening test for a mental or emotional disorder (i.e., depression, anxiety, 
Attention-Deficit/Hyperactivity Disorder [AD/HD], dementia) would be adminis- 
tered concurrently with a clinical diagnosis from a qualified mental health profes- 
sional (sometimes called diagnostic validity), or the administration of an academic 
achievement test to a group of children and concurrent identification of low-per- 
forming students or students "at risk" for academic failure by a teacher or diagnosti- 
cian (sometimes called decision reliability). Examples of predictive decisions would 
involve any of these previous examples, but with the criterion of diagnosis or deter- 
mination of "at risk" status being collected months or years after the screening test 
was administered. In this way, the screening test would be used to predict future 
problems, usually allowing professional counselors and educators to put prevention 
or early intervention programs in play to lower the incidence of future problems. 
Whether used for concurrent or predictive purposes, the goal of the procedure is to 
maximize the likelihood of accurate decisions (sometimes called hits) while minimiz- 
ing inaccurate decisions (sometimes called misses or errors). Remember, the ultimate 
purpose of a screening procedure is to identify clients or students in need of deeper- 
level diagnostic assessment. 

As an example of applying decision theory, assume that a professional counselor 
has been asked to develop an accurate screening procedure to identify adults at risk 
for depression. The professional counselor first explores the literature and selects a 
published, efficient screening device for depression whose scores have previously 
demonstrated sufficient reliability and validity for screening-level purposes. To deter- 
mine the adequacy of the depression inventory for the requested service, the profes- 
sional counselor arranges for each new adult referral to several area clinics to com- 
plete the depression inventory and undergo a diagnostic evaluation with a qualified 
mental health professional. Selection of the criterion is critical. It is often viewed as 
the "gold standard" and should have the qualities of excellent score reliability and 
validity. This diagnostic evaluation would normally serve to identify mental and 
emotional disorders related to the clients' presenting problems and to aid in estab- 
lishing goals for counseling but because of the study's focus will also result in a clin- 
ical determination regarding the degree of clinical depression in the clients on a 5- 
point scale (e.g., 1 = Absence of Depressive Symptoms, 2 = Slightly Depressed, 3 = 
Mildly Depressed, 4 = Moderately Depressed, 5 = Severely Depressed). {Note: 
Admittedly, the diagnosis of depressive disorders is complex; for the sake of this ex- 
ample, the process has been simplified). The professional counselor then collects two 
pieces of data for each of the next 50 adult clients to the area clinics: (1) the screen- 
ing test score and (2) the clinical decision of the presence of clinical depression on 
the 5-point scale. The results of these 50 participants are presented in Figure 4.1. 

As can be seen in Figure 4. 1 , the distribution of scores is somewhat broad, rang- 
ing from to 50 on the depression screening test (0 indicates the Absence of 



1 36 Chapter 4 



M 
B 
9 

DC 

c 
o 



Q. 

Q 






5 




















































II 












• 






























(6) False Rejections 


















1 










































(21) Valid Acceptances 








4 














• 




• 




• 




• 


• 


• • 




• 




• 


















































































































































3 


























^ 




















* 








9 








* 




* 




* 






* 
















































































































III 
















IV 
























2 


• • 


»• 


• • 


» 


• * 


• 














• a 






















(20) Valid Rejections 










(3) False Acceptances 










































































































1 


















• 











































































































































































10 




20 




30 




40 




50 






Identified 



Criterion Cutoff 




Not Identified 



I 
Test Score Cutoff -^ 

Score on the Depression Screening Test 

Figure 4.1 An application of decision theory using a criterion cutoff score of 3 



Depression; 50 is the highest score possible and indicates Severe Depression), and 
from 1 to 5 on the clinical diagnostic rating (1 indicates the Absence of Depressive 
Symptoms; 5 indicates Severe Depression). The professional counselor now needs to 
use judgment in applying the decision-making model. How this judgment is applied 
may vary and, as will be seen below, has implications for the accuracy of decisions 
(i.e., the hit rate). One can see from Figure 4.1 that a criterion score cutoff line has 
been placed at scores of 3 or higher, and a test score cutoff line at scores of 20 or 
higher. The criterion cutoff have the teacher, mother, and father complete the respective versions 
of the DBRS, then plug their scores into the regression formula. For Juanita, 
assuming X x = 73, X 2 = 67, and X 5 = 55, the prediction formula would be: Y' = 
1.21 + (0.031)(73) + (0.024)(67) + (0.017X55) = 1.21 + 2.263 + 1.608 + 0.935 = 
6.016. Thus Juanita would be identified as having fulfilled the diagnostic criteria for 
AD/HD-PIT. For another, more distractible child, Nakita, presenting with scores of 



142 Chapter 4 



Table 4.2 T Scores on the DBRS Distractible Subscale for Three Students and Criterion Cutoff Scores 



Student name 



Teacher score (X,) Mother score (X 2 ) Father score (X 3 ) 



Decision 



Juanita 


70 


Nakita 


78 


Susanna 


37* 


Cutoff score required 


65 



67 
88 
49* 
65 



55 
90 
40* 
65 



No 
Yes 

No 



Note: ' designates a scote falling below the tequited cutoff scote of T = 65. 



X x = 78, X 2 = 88, and X 3 = 90, the prediction formula would be: Y' = 1.21 + 
(0.031X78) + (0.024)(88) + (0.017)(90) = 1.21 + 2.418 + 2.112 + 1.530 = 7.270. 
Thus, Nakita would be identified as having fulfilled the diagnostic criteria for 
AD/HD-PIT. For a third, less distractible child, Susanna, presenting with scores of 
X x = 37, X 2 = 49, and X 3 = 40, the prediction formula would be: Y' = 1.21 + 
(0.031X37)"+ (0.024)(49) + (0.017)(40) = 1.21 + 1.147 + 1.176 + 0.68 = 4.213. 
Thus, Susanna would not be identified as having fulfilled the diagnostic criteria for 
AD/HD-PIT. 

The primary advantage of the multiple regression technique is that it allows 
some scores to compensate for other scores. For instance, while the results were not 
in doubt in either Nakita's or Susanna's case, in Juanita's case, her father viewed her 
level of distractibility to be more or less normal (T = 55), while her teacher's and 
mother's scores were elevated (T = 73 and 67, respectively). These scores compen- 
sated for the low score of the father and put Juanita in the "diagnose" category. A 
primary disadvantage of the multiple regression technique is the necessity of labor- 
intensive preliminary data collection, data analysis, and standard setting. It is a lot of 
work to collect the several hundred protocols necessary to yield a reliable multiple re- 
gression equation. 

Multiple cutoff method 

The multiple cutoff method is far simpler to set up and implement than the multi- 
ple regression procedure. Basically, multiple cutoff means that the professional coun- 
selor must establish a minimally acceptable score on each measure under considera- 
tion, then analyze the scores of a given client or student and determine whether each 
of the scores meets the given criterion. Importantly, failure to meet even one of the 
cutoff scores will eliminate an examinee from consideration. As an example, consider 
the scores on the DBRS for the three girls, which are now presented in Table 4.2 for 
ease of comparison. 

The criterion score standard-setting decision is of critical importance in the 
multiple cutoff technique because criterion scores set too low will overidentify indi- 
viduals who do not have the condition, and criterion scores set too high will under- 
identify individuals who do have the condition. In the context of this multiple cut- 
off technique example, Nakita would be identified with AD/HD-PIT because each 
of her T scores on the DBRS exceeded the minimum criterion T score of 65. 
Likewise, Susanna would not be identified because none of her T scores on the 



Validity 1 43 

DBRS was high enough to warrant diagnosis. Interestingly, Juanita, who did qualify 
under the multiple regression procedures explained in the preceding section, would 
not be identified with AD/HD-PIT using these criterion scores because her father's 
rating of her did not meet the specified criterion (i.e., his rating of Juanita was a T 
score of 55, and a minimum score of 65 was required). 

It is important to understand that multiple cutoff techniques use hard-and-fast 
criteria, and violations are not allowed. Thus a low score on one test can effectively 
eliminate someone from consideration; other scores are not allowed to compensate 
for deficient scores, such as was the case in the multiple regression model. Therefore, 
a less than optimal administration for any reason (i.e., low motivation, response bias, 
faking bad or good) could result in a selection error. Because the multiple cutoff 
method is easier to set up and manage than the multiple regression method, it is 
more widely used. However, most clinicians use a third method, clinical judgment 
and diagnosis using a test battery. 



Think About It 4.3 How could you apply the multiple regression or 
multiple cutoff models to a decision-making problem in your area of coun- 
seling specialty? 



Clinical judgment and diagnosis using a test battery 

Clinical judgment relies on the experiences, information processing capability, the- 
oretical frameworks, and reasoning ability of the professional counselor to make 
sense out of sometimes-conflicting information, to arrive at a rational decision about 
the disposition of a client or student. Clinical judgment is not a statistics-driven de- 
cision-making method per se. Test results, interview information, behavioral obser- 
vations, and other data are interpreted and integrated, leading to a reasoned judg- 
ment or decision. Clinical decision making using a test battery can be a very complex 
undertaking, depending on the presenting problem(s), and requires a good deal of 
education, supervised training and experience, and analytical capabilities. It is also 
subject to theoretical differences and examiner bias; that is, the same information 
often leads to different conclusions based on a professional counselor's theoretical 
orientation(s) and personal or professional biases. A clinical case of a young girl eval- 
uated for problems with distractibility is presented in Box 4. 1 to demonstrate how 
data can be interpreted and integrated so that a clinical decision can be made. 



Box 4.1 Clinical Judgment Using a Battery of Tests: 
Case Study of Nakita 

Identifying Information 

Name: Nakita 

Chronological Age: 1 2 years, 2 months 
Grade Placement: 6.6 

continued 



1 44 Chapter 4 



Box 4. 1 continued 

Reason for Referral and Initial Case Conceptualization 

Nakita was referred for psychoeducational evaluation by her mother. The 
primary referral concerns were distractibility, difficulty understanding and/or 
following directions, and poor school performance in the academic areas of 
reading, science, and written expression. No significant emotional issues 
were reported by the parents or school. Initially, this evaluator sought to ex- 
plore the existence of a significant learning disorder in reading and writing 
and significant degrees of inattention commonly associated with AD/HD. A 
general emotional and behavioral screening was also undertaken to rule in or 
rule out conditions that mask and mimic the symptoms of inattention, as 
well as determine Nakita's general level of emotional adjustment. 

Assessment Techniques 

Because the referral concern was both behavior (inattention) and academic 
(language arts, science), the examiner chose instruments that would be useful 
in the identification of potential learning problems and behavior disorders, 
such as AD/HD, and would also screen for emotional adjustment. The fol- 
lowing assessments were intentionally selected at the outset of the evaluation: 

■ Wechsler Intelligence Scale for Children — Fourth Edition {WISC-IV) (as an 
intellectual assessment to establish an anchor score for expected achieve- 
ment levels and to determine learning strengths and weaknesses) 

■ Beery s Developmental Test of Visual-Motor Integration {VMI-3, Motor, and 
Visual) (as a gross screen for visual perception, fine-motor coordination, 
and visual-motor integration) 

■ Woodcock-Johnson Tests of Achievement — Third Edition (WJ-III ACH) (to 
establish achievement levels in the major academic subject areas and deter- 
mine whether a learning disorder is evident) 

■ Conners' Parent and Teacher Rating Scale — Revised: long Versions ( CPRS-R.T. 
and CTRS-R.I) (to screen for inattention and other behavioral/emotional 
concerns) 

■ Clinical interview (exploration of developmental history and clinical con- 
ditions using structured protocols found in Appendixes A and C of Erford, 
2006). 

The following tests were also administered as a result of additional questions 
and hypotheses that came up during the evaluation: 

■ Test of Auditory Perceptual Skills-Revised ( TAPS-R) — Word Discrimination 
and Auditory Processing subtests (to rule out auditory perceptual and pro- 
cessing deficiencies) 

■ Jebsen Writing Speed subtest (to assess for handwriting speed, sometimes 
deficient in clients with fine-motor coordination and processing speed 
difficulties) 

■ Stanfbrd-Binet Intelligence Scale — Fourth Edition: Memory for Sentences sub 
test (to assess for language-loaded short-term auditory memory skills) 



Validity 1 45 

■ Wide Range Achievement Test — Third Revision (WRAT-3): Spelling subtest 
(as a validating spelling test) 

■ Slosson Written Expression Test (SWET) (for further exploration of writing 
mechanics) 

■ Visual Aural Digit Span Test ( VADS) (for further exploration of short-term 
auditory and visual memory difficulties) 

Background Information 

Clinical interviewing using a structured protocol and reports from the 
teachers provided a wealth of helpful background information. Nakita is a 
12-year, 2-month-old African American girl currently attending grade 6 at 
XYZ Middle School. Her mother reports the primary concerns to be age- 
inappropriate inattention and difficulty in the academic areas of language 
arts and sciences. Nakita is reported to be easily distracted by the slightest 
sound and easily frustrated. She is very artistic and enjoys drawing. She has 
struggled with reading since the first grade. Currently, reading comprehen- 
sion appears to be problematic, as well as understanding word problems in 
math. Recently, Nakita has begun to struggle in science, and this difficulty 
appears to result from a complex interaction of reading comprehension, 
conceptual difficulties, and teaching style. Nakita also reportedly has diffi- 
culty following multistep directions, although it is unclear whether this 
difficulty is due to a lack of understanding or to a lack of motivation. She 
has a wonderful sense of humor, but is becoming more temperamental 
when it comes to academic tasks. 

Previous group-administered testing indicated Average to High Average 
school ability on the Otis-Lennon School Ability Test (OLSAT). Her 5th-grade 
achievement testing indicated Average math achievement (46th percentile), 
reading comprehension (58th percentile), and writing mechanics (30th per- 
centile). Mr. Trig, Nakita's math and social studies teacher, is concerned 
about Nakita's weak skill retention in math. Nakita reportedly needs a lot of 
practice and relearning to keep her grades in the passing range. He also re- 
ports that Nakita is very distractible and impulsive. Socially and emotionally, 
Mr. Trig describes Nakita as a very pleasant and kind student who is always 
smiling. Mrs. Bookworm, Nakita's language arts teacher, reports that Nakita 
often becomes talkative and "clowns around" during inappropriate moments 
in class — often when answering questions or presenting in front of the class. 
Because of being behaviorally off-task, Nakita often misses important infor- 
mation and displays inconsistent comprehension. Mrs. Bookworm also re- 
ports that Nakita has a wonderful zeal for learning and a sense of humor that 
often energizes classroom activities. She is a hard worker and frequently par- 
ticipates in classroom discussions. She is also very loyal and supportive of 
friends. Although Nakita struggles with higher-order thinking skills, compre- 
hension, and writing mechanics, Mrs. Bookworm believes that she is a 
bright, tenacious, and capable student. 

continued 



146 Chapter 4 



Box 4.1 continued 

Nakita attended XYZ Elementary from kindergarten through grade 5. 
Reading has always been an area of academic difficulty. She has traditionally 
displayed a poor sight-word vocabulary and reading comprehension. She has 
not displayed letter-number reversals since grade 1 . Nakita is currently 
placed in the "low" math group, according to her mother. Her math calcula- 
tion skills appear satisfactory, but Nakita is struggling with the story prob- 
lems. Nakita's short-term memory (both auditory and visual) is reportedly 
poor. Written language has also been an area of consistent difficulty. Her 
spelling, capitalization, and punctuation skills are reportedly deficient. She 
has excellent penmanship, and is a fast keyboarder. Nakita taught herself to 
keyboard and is very proud of her ability in this regard. 

Nakita's parents divorced five years ago. Nakita has an older sister who is 
a very strong student. Nakita does engage in periodic day visits with her fa- 
ther, but no overnight stays. Nakita's birth and developmental history was 
normal, and she met all developmental milestones either on time or ahead of 
time. Her medical history is unremarkable. Nakita is reportedly a happy, so- 
ciable child. She is very outgoing and popular with peers. Her mother and 
teachers report that Nakita's social and emotional development is within nor- 
mal limits and not of primary concern at this time. 

Maternal family history reportedly is negative for learning and emotional 
problems. Her mother reports she was a straight-A student and not at all dis- 
tractible. She completed one year of college and is currently employed in real 
estate management. Nakita's birth father was not available for interview. 
Nakita's mother reports seeing many similarities in learning styles between 
Nakita and her father. She indicated that Nakita's father was a strong math 
student, but struggled academically — although no specific details were pro- 
vided. He did not finish high school and is currently a construction worker. 
She indicated that Nakita's father enjoyed reading and was very artistic but 
had poor writing skills. He reportedly had great difficulty focusing his atten- 
tion on task and was easily distracted. A paternal grandmother reported that, 
as a child, Nakita's father was very overactive. A paternal brother has been di- 
agnosed with depression and, reportedly, is aggressive and possesses a temper. 
Nakita's father also reportedly has difficulty controlling his temper. 

The formal evaluation was conducted over two mornings in consecutive 
weeks. Formalized evaluation centered on the areas of intellectual, percep- 
tual, achievement, behavioral, and emotional development. Nakita was a 
well-mannered child and was very cooperative during the evaluation. 
Rapport was easily established, and she attempted all items presented to her. 
Nakita displayed a quite high interest level throughout the evaluation. She 
displayed no obvious physical or sensory deficits, nor did she appear anxious. 
Therefore, the obtained results are considered to be an accurate representa- 
tion of Nakita's current level of functioning. Her test results, briefly inter- 
preted, are given in Tables 4.3 through 4.6. 

Nakita was administered the Wechsler Intelligence Scale for Children — 
Fourth Edition {WISC-IV) to establish a level of expectation for scholastic 



Validity 



147 



Table 4.3 What Nakita's scores mean 



Standard score 


Scale score 


T scon 


130+ 


16+ 


70+ 


120-129 


14-15 


63-69 


110-119 


12-13 


57-62 


90-109 


9-11 


43-56 


80-89 


6-8 


37-42 


70-79 


4-5 


30-36 


55-69 


3 


20-29 


40-54 


2 


10-19 


<40 


0-1 


<10 



Interpretive range meaning 



Very Superior 

Superior 

High Average 

Average 

Low Average 

Borderline 

Mildly Deficient 

Moderately Deficient 

Severe and Profoundly Deficient 



Wechsler Intelligence Scale for Children — Fourth Edition (WISC-IV) 

IQ; Range Percentile rank; Range Interpretive range 



Verbal Comprehension Index 1 19; 1 1 1-125 


90; 77-95 


High Average to Superior 


Perceptual Reasoning Index 1 17; 108-123 


87; 70-94 


Average to Superior 


Working Memory Index 74; 68-84 


4; 2-14 


Mildly Deficient to Low Average 


Processing Speed Index 75; 69- 87 


5; 2-19 


Mildly Deficient to Low Average 


Full Scale IQ 100; 95-105 


50; 37-63 


Average 


Verbal Comprehension Index subtests 


Perceptual Reasoning Index subtests 


Similarities 14 S* 


Block Design 


11 


Vocabulary 1 2 


Picture Concepts 


13 


Comprehension 14 S 


Matrix Reasoning 


14 S 


Working memory index subtests 


Processing speed index 


subtests 


Digit Span 5 W* 


Coding 


5W 


Letter-Number Sequencing 5 W 


Symbol Search 


6W 



Note: * S = Intrapersonal strength; W = Intrapersonal weakness. 



achievement and identify her learning strengths and weaknesses. Nakita's 
Verbal Comprehension Index (VCI) score was measured to lie in the High 
Average to Superior range (percentile rank = 90; percentile rank range = 
77-95), commensurate with her Perceptual Reasoning Index (PRI) score, 
which fell in the Average to Superior range (percentile rank = 87; percentile 
rank range = 70-94). While Nakita currently performs in the Average range 
of general cognitive ability (Full Scale percentile rank = 50; percentile rank 
range = 37-63), her true educational potential is probably much closer to 
her VCI and PRI capabilities (standard score of approximately 1 18; High 
Average to Superior capabilities), and it is this score that will serve as the an- 
chor score for determining intrapersonal weaknesses and achievement areas 
in need of improvement. Nakita's Working Memory Index (WMI) score fell 

continued 



1 48 Chapter 4 



Box 4.1 continued 

in the Mildly Deficient to Low Average range (percentile rank = 4; percentile 
rank range = 2-14), as did her Processing Speed Index (percentile rank = 5; 
percentile rank range = 2-19). Both the WMI and PSI were significantly 
below current ability estimates and are considered significant intrapersonal 
weaknesses. Subtest analysis indicates that Nakita displayed intrapersonal 
strengths on tasks requiring verbal abstract reasoning (Similarities subtest 
percentile rank = 90); social comprehension (Comprehension subtest per- 
centile rank = 90); and visual analogical reasoning (Matrix Reasoning subtest 
percentile rank = 90). Significant intrapersonal weaknesses were noted on 
tasks requiring short-term auditory recall (Digit Span subtest percentile 
rank = 5); recall and organization of auditory stimuli (Letter-Number 
Sequencing percentile rank = 10); short-term visual recall and psychomotor 
speed (Coding subtest percentile rank = 5); and speed in processing visual 
information (Symbol Search subtest percentile rank = 10). Thus Nakita 
presents as a bright child with potential weaknesses in processing speed and 
in short-term auditory and visual memory. 



Stanford-Binet Intelligence Scale — Fifth Edition: Sentence Memory subtest 

Standard Score = 92 Percentile Rank 



29 



Test of Auditory Perceptual Skills-Revised ( TAPS-R) 
Auditory Word 

Discrimination subtest Scaled Score = 1 1 

Auditory Processing 

subtest Scaled Score = 1 2 



Percentile Rank = 63 
Percentile Rank = 75 



Because a presenting concern had to do with Nakita's ability to under- 
stand directions, it was important to explore the possible existence of a lan- 
guage processing disorder and central auditory processing disorder. The 
above-mentioned WISC-IWCA subtest results do not support the existence 
of a language processing disorder because they all fell in the above-average 
ranges. To rule out the existence of a central auditory processing disorder, 
two subtests from the Test of Auditory Perceptual Skills-R were administered. 
Nakita performed in an Average to High Average capacity on each subtest. 
She scored at the 63rd percentile rank on a task requiring auditory word dis- 
crimination and the 75th percentile rank on a task purporting to measure 
auditory processing. Thus little support was garnered for the existence of a 
central auditory processing disorder. 

To further assess Nakita's short-term auditory recall, the Memory for 
Sentences subtest of the Stanford-Binet Intelligence Scale — Fourth Edition was 
administered. Nakita performed at the 29th percentile on this task, commen- 
surate with WMI estimates and significantly below intellectual estimates. 



Visual Aural Digit Span Test ( VADS) 



Visual Memory 
Auditory Memory 



10th percentile 
i^di percentile 



Next, the VADS was administered to validate weaknesses in short-term 
auditory and visual memory observed during administration or the WISC- 



Validity 1 49 

IV On this administration of the VADS, Nakita scored at the 10th and 25th 
percentiles on the visual and auditory memory components, respectively. 
Both performances were significantly below expected levels and validate the 
weaknesses observed during administration of the WISC-IV. Thus the exis- 
tence of significant distractibility in the auditory and visual channels remains 
as a primary explanation for Nakita's difficulty in successfully performing in 
class and carrying out multistep directions. 

Notice that each "hypothesis" generated from the presenting problem is 
being systematically explored through clinical interviewing and results from 
selected tests. 

Jebsen Writing Speed Subtest Trial 1 = 22 seconds (approximately the 15th percentile) 
Trial 2 = 23 seconds (approximately the 1 5th percentile) 

To validate the apparent weakness in processing speed, the Jebsen Writing 
Speed subtest was administered and resulted in deficient writing speed per- 
formances. The 15th percentile is one standard deviation below the mean, 
indicating that about 85 percent of same- aged girls can write faster than 
Nakita. This slow motor speed was commensurate with the deficient 
Processing Speed Index scores reported above. These results are extraordinar- 
ily important when trying to understand the academic difficulties that 
Nakita is currently facing. These results indicate that Nakita's processing and 
writing speed are substantially slower than expected for a child of her ability. 
This is likely to be evidenced in the classroom through slower writing, note- 
taking, and task completion speeds. 

Test of Visual Motor Integration VMI Standard Score = 120 Percentile Rank of 91 

Visual Standard Score =122 Percentile Rank of 93 
Motor Standard Score = 90 Percentile Rank of 25 

Nakita's performance on Beery s Developmental Test ofVisual-Motor 
Integration — Third Edition (VMI-3) exceeded that of 91% of other children 
her age, falling in the High Average to Superior range of performance. This 
edition of the Beery also allows exploration of visual-perceptual and motor ca- 
pabilities. Nakita's fine-motor coordination performance exceeded only 25% 
of age-mates (Low Average to Average), while her performance on the visual- 
perceptual task of the VMI-3 was High Average to Superior (93rd percentile 
rank). Altogether, Nakita's visual-motor and visual discrimination capabilities 
appear well developed at this time, actually exceeding current intellectual abil- 
ity estimates. However, her fine-motor coordination is poorly developed. 

In an effort to explore Nakita's current educational achievement and de- 
termine whether significant learning problems are occurring in the areas of 
reading and writing, selected subtests of the Woodcock-Johnson: Tests of 
Achievement — Third Edition (WJ-III), the Wide-Range Achievement Test — 
Third Edition (WRAT-3), and the Slosson Written Expression Test (SWET) 
were administered. 

continued 



1 50 Chapter 4 



Table 4.4 Woodcock-Johnson Tests of Achievement-Third Edition (WJ-III) (Conversions based on age norms) 



Subtest 



Standard score 



Percentile rank 



Range 



Word identification 
Passage comprehension 
Reading fluency 
Math calculation 
Applied problems 
Math fluency 
Spelling 

Writing samples 
Writing fluency 



105; 97-1 13 
114; 102-126 
90; 85- 95 
103:93-113 
96; 84-108 
78; 74- 82 
86; 76- 96 
111; 97-125 
88; 83-93 



64; 41-80 
83; 56-96 
25;16-37 
59; 32-80 
39; 15-70 
8; 4-12 
18; 6-39 
77; 43-95 
21;17-32 



Average to High Average 
Average to Superior 
Low Average to Average 
Average to High Average 
Low Average to Average 
Borderline to Low Average 
Borderline to Average 
Average to Superior 
Low Average to Average 



Table 4.5 Slosson Written Expression Test (SWET) 



Subscale/Scale 


Scaled/Standard score 


Percentile rank 


Interpretive range 


Writing maturity 


100; 90-110 


50; 25-75 


Average 


Type-token Ratio 


11; 9-13 


63; 37-84 


Average to Above Average 


Av. Sentence Length 


9; 6-12 


37; 10-75 


Below Average to Average 


Writing mechanics 


81; 76- 90 


10; 5-25 


Deficient to Average 


Spelling 


7; 5- 9 


16; 5-37 


Deficient to Average 


Capitalization 


6; 4- 8 


10; 1-25 


Very Deficient to Average 


Punctuation 


8; 6-10 


25; 10-50 


Below Average to Average 


Written expression total SS* 


89; 83- 97 


23; 13^42 


Below Average to Average 



Note: ' SS = Standard Score (M = 100: SD = 15) 



Box 4.1 continued 

Wide- Range Achievement Test — Third Revision 

Spelling Subtest Standard Score = 88 Percentile Rank = 21 

Nakita was administered the Woodcock-Johnson Tests of Achievement — 
Third Edition {WJ-III) to explore reported weaknesses in language arts con- 
tent areas. On the tests of reading, some task variability was noted as her pas- 
sage comprehension skills {percentile rank range = 56-96; Average to 
Superior) were slightly better developed than her sight-word vocabulary {per- 
centile rank range = 41-80; Average to High Average). Both of these areas 
were commensurate with current ability estimates. However, her reading flu- 
ency was significantly below expected levels given current ability estimates 
{percentile rank range = 16-37; Low Average to Average). Reading fluency is 
a function of processing speed, reading speed, and attentional control, and 
this performance represented a 28-point discrepancy below ability. 

In mathematics, Nakira's calculation skills were Average to High Average, 
exceeding approximately 59% of age-mates {percentile rank range = 32-80), 



Validity 151 

while her problem-solving capabilities were Low Average to Average, exceed- 
ing approximately 39% of age-mates [percentile rank range = 15-70). Her 
math problem-solving skills were slightly to significantly below current abil- 
ity estimates (a 22-standard-score-point discrepancy). However, her math 
fluency score was very significantly below expected levels given current abil- 
ity estimates [percentile rank range = 4-12; Borderline to Low Average). 
Math fluency is a function of processing speed, computational speed, and at- 
tentional control, and this performance represented a 40-standard-score- 
point discrepancy below ability. 

Nakita's written expression in context (Writing Samples subtest) was sig- 
nificantly better developed [percentile rank range = 45-95; Average to 
Superior) than her spelling skills in isolation (Spelling subtest percentile rank 
range = 6-39; Borderline to Average). Her written expression was commen- 
surate with ability estimates, while her spelling skills were significantly defi- 
cient. The WRAT-3 Spelling subtest was administered to further explore 
Nakita's spelling skills and she performed at the 21st percentile (Low Average 
to Average), confirming deficient spelling skills. She appears to struggle sub- 
stantially with nonconventional spelling patterns. Interestingly, her Writing 
Samples responses were frequently inappropriately punctuated and capital- 
ized and were comprised of simple vocabulary and sentence structure. To 
further explore the nature of suggested writing difficulties, the Slosson 
Written Expression Test [SWET) (Hofler, Erford, & Amoriell, 2001) was ad- 
ministered. The SWET requires the student to compose a story about a pic- 
ture cue, and the product is scored for writing maturity and mechanics. On 
this administration of the SWET, Nakita's Writing Maturity Index was 
slightly below expected levels, but her Writing Mechanics Index was 37 stan- 
dard-score points below current ability estimates. Importantly, her capitaliza- 
tion, punctuation, and spelling were consistently poorly developed. Thus, a 
Disorder of Written Expression (mechanics) is evident to a significant de- 
gree. In addition, Nakita's Writing Fluency subtest score was very signifi- 
cantly deficient in comparison with current ability estimates (percentile rank 
range = 17-32; Low Average to Average). Writing fluency is a function of 
processing speed and attentional control, and this performance represented a 
30-point discrepancy below ability. 

Because a referral question was whether Nakita possessed significant 
problems with inattention, clinical and behavioral assessments focused on 
the presence of age- and ability-inappropriate levels of distractibility, primary 
symptoms of an Attention-Deficit/Hyperactivity Disorder (AD/HD). 

Mr. Trig and Mrs. Bookworm, teachers who have instructed Nakita and 
who are well acquainted with her academic and behavioral performance, 
completed the Conners' Teacher Rating Scale — Revised, Long Version [CTRS- 
R:L). Nakita's mother completed the Conners' Parent Rating Scale — Revised, 
Long Version [CPRS-R:L). All respondents indicated substantial concerns re- 
garding Nakita's inattentive behaviors, indicating that Nakita frequently 

continued 



152 Chapter 4 



Table 4.6 [Nakita's results from the Conners' Rating Scales— Revised] 



Conners' Parent Rating Scale 
Revised: Long Version 
(CPRS-R:L) 



Conners' Teacher Rating Scale 
Revised: Long Version 
(CTRS-R:L) 



Conners' Teacher Rating Scale 
Revised: Long Version 
(CTRS-R:L) 



Respondent: Nakita's mother 


Respondent: Mr. Trig 




Respondent: Mrs. Bookworm 


Scale T Score 


Scale 


T Score 


Scale T score 


A. Oppositional 61 


A. Oppositional 


46 


A. Oppositional 50 


B. Cognitive Problems 67* 


B. Cognitive Problems 


76* 


B. Cognitive Problems 74* 


C. Hyperactivity 49 


C. Hyperactivity 


54 


C. Hyperactivity 48 


D. Anxious/shy 49 


D. Anxious/shy 


46 


D. Anxious/shy 46 


E. Perfectionism 42 


E. Perfectionism 


41 


E. Perfectionism 49 


F. Social Problems 45 


F. Social Problems 


46 


F. Social Problems 46 


G. Psychosomatic 51 








L. DSM-IV: Inattentive 67* 


L. DSM-IV: Inattentive 


76* 


L. DSM-IV: Inattentive 68* 


M. DSM-IV: Hyper-Impulsive 47 


M. DSM-IV: Hyper-Impulsive 55 


M. DSM-IV: Hyper-Impulsive 46 



Note: * designates a score falling above the required cutoff score of T = 65. 



Box 4.1 continued 

avoids engaging in tasks requiring sustained mental effort; fails to give close 
attention to details; has difficulty sustaining attention on tasks; is easily dis- 
tracted by sights and sounds; loses things needed for tasks; and has difficulty 
concentrating. Each of these items loads heavily on inattention, a core com- 
ponent of AD/HD — Predominantly Inattentive Type. All other personality 
and behavioral functioning was reported to be well within normal limits. 

A clinical interview involving both Nakita and her mother confirmed 
much of the evidence substantiating a mild to moderate attentional defi- 
ciency without the associated hyperactive features. However, because it has 
been well documented in research literature that myriad conditions exist that 
mask and/or mimic the symptoms associated with AD/HD, an exhaustive 
interview was conducted to rule out more than two dozen clinical and cogni- 
tive disorders that often lead to misdiagnosis (see Appendix C of Erford, 
2006). Upon concluding this interview, Nakita was determined to not dis- 
play substantial symptoms associated with disruptive behavior, anxiety, or 
depressive disorders. No medical history of lead poisoning, hyperthyroidism, 
or allergies in Nakita or family members was reported. Nakita does not ex- 
hibit a visual or auditory processing disorder, and her history is reportedly 
negative for physical or sexual abuse and abuse of alcohol or other drugs. No 
tic or seizure disorders, hallucinations, or delusions were reported or evi- 
denced, and Nakita displayed a history of positive social relationships and in- 
teractions. Thus myriad conditions shown to mask and/or mimic AD/HD 
were ruled out. 

In conclusion, behavior rating scales, cognitive-perceptual information, 
and clinical interview confirm that Nakita lullills the diagnostic criteria for 



Validity 1 53 

Table 4.7 DSM-IV-TR diagnostic summary for Nakita 

Axis I — 314.00 AD/HD — Predominantly Inattentive Type 

314.5 — Developmental Coordination Disorder (fine-motor) 

315.2 — Disorder of Written Expression (mechanics, spelling) 

315.9 — Learning Disorder — NOS (processing speed) 
Axis II — None 
Axis III — None 

Axis IV — Academic/testing problems 
Axis V — Global Assessment of Functioning (GAF) (current) = 69 



AD/HD — Predominantly Inattentive Type, Developmental Coordination 
Disorder, and Disorder of Written Expression. These conditions are 
presently mild to moderate in severity and are affecting her schoolwork 
production and performance. Also of concern is a deficiency in processing 
speed [Learning Disorder — Not Otherwise Specified (NOS)] that ad- 
versely impacts her motivation to engage in written expression and other 
academic activities and affects the quality of written expression and other 
academic output. 

Final Conceptualization and Recommendations 

Nakita is a 12-year-old girl currently attending grade 6 at XYZ Middle 
School. She currently performs in the Average range of general intellectual 
ability, but her VCI and PRI index scores indicate her intellectual capabilities 
are much higher (deviation IQ estimate = 118). Deficiencies in processing 
speed and short-term auditory and visual recall were noted. A significant 
achievement deficiency was noted in written expression and spelling 
(Disorder of Written Expression). This inconsistency is often apparent in 
children with deficient processing speed because the speed of their written 
expression cannot keep up with the flow of ideas they are trying to commu- 
nicate. Frequently inattentive and disorganized, Nakita fulfills the diagnostic 
criteria for Attention-Deficit/Hyperactivity Disorder — Predominantly 
Inattentive Type. In addition, Nakita displays a Developmental 
Coordination Disorder (fine motor). At this time, the extent of these disor- 
ders appears mild to moderate in severity and affects Nakita's schoolwork 
production and performance. 

The following recommendations are offered: 

1. Nakita's mother is encouraged to share the results of this evaluation with 
Nakita's physician and to seek the physician's guidance in developing a 
treatment plan that addresses Nakita's inattentiveness and disorganization. 

2. Nakita may benefit from short-term remedial tutoring in written expres- 
sion and mechanics. In particular, this course of action should address a re- 
view of written-language mechanics rules (punctuation, capitalization, and 

continued 



1 54 Chapter 4 



Box 4. 1 continued 

grammar in context), as well as composition construction strategies and 
skills. 

3. Nakita can be helped to better understand task directions when she and 
her teachers and parents break down multistep directions into a sequence 
of ordered steps. It will help to: 

■ Write them down and number the steps so Nakita can complete the 
steps one at a time. 

■ Have Nakita check with an adult after completing each step and be- 
fore moving on to the next step. She currently is experiencing a good 
deal of frustration by making mistakes and misunderstanding direc- 
tions in the early steps of a multistep task. Having an adult check her 
progress at each step before moving on will help eliminate some of this 
frustration. 

■ Be sure Nakita is on the right track when beginning the assignment. 

■ Give an example of what she is to do. 

■ Check her progress frequently. 

■ Have Nakita rephrase directions in her own words to be sure she un- 
derstands them. 

■ Have a well-organized student help Nakita transition from step to step. 

■ Have Nakita do two or three examples under the supervision of a 
teacher, parent, or student helper to be sure she understands the 
process before beginning to complete items independently. 

■ Make sure multistep directions are written down, whether on the paper, 
a chalkboard, or an index card. 

4. Classroom and home-study modifications that may facilitate Nakita's aca- 
demic performance include: 

■ Consider creating compositions with a written outline and verbally 
constructing the composition on audiotape. A transcription of the au- 
diotape can then be made, embellished on, and proofread. This proce- 
dure will capitalize on Nakita's verbal strengths and minimize the frus- 
tration that ensues by her forgetting good ideas when trying to 
construct compositions from memory. 

■ Encourage Nakita to further develop keyboarding skills to facilitate her 
typing. She should strive to type at a rate of greater than 40 words per 
minute by the beginning of her 9th-grade year. 

■ Allow Nakita to compose compositions and other written work on a 
word processor. She should immediately begin to type and edit her 
written work using the word processor. 

■ Cut back repetitive homework assignments beyond the point of 
mastery. 

■ Give Nakita preferential seating near the primary area of instruction, 
with her back facing any distracting students or stimuli. 

■ Surround her with focused role models who will not distract her and 
who will not allow Nakita to distract them. 



Validity 1 55 

Classroom and home-study modifications that may facilitate Nakita's be- 
havioral and work habit adjustments include: 

■ A daily assignment notebook that allows daily or at least weekly com- 
munication between the parents and teachers. 

■ Praise and encouragement that emphasize Nakita's accomplishments 
and successes (no matter how small). 

■ Brief verbal reprimands addressing behaviors, not perceived motiva- 
tions, followed by praise and encouragement for successes. 

■ Behavioral contracts that identify specific academic and behavioral 
goals. 

■ The use of a timer to break assignments into smaller time units of more 
intense focus. For a preteenager with a short attention span, timed 
units should not generally exceed 1 5 to 30 minutes. After a short break 
with plenty of performance feedback and encouragement, as well as 
some physical movement or exercise, the next timed task can ensue. 

■ Appropriate home and school study spaces, with set times, no distrac- 
tions, and a recognized routine. 

Treatment of AD/HD can be addressed best through a combination of: 

■ Parent, teacher, and student education on the nature and treatment of 
AD/HD. 

■ Behavior modification to address educational and behavioral issues. 

■ Educational modifications to make Nakita more successful in the 
classroom. 

■ Medical intervention as determined by Nakita's attending physician. 
Because of Nakita's slow processing speed, she will benefit from extra time 
given to complete standardized tests, particularly timed, group-adminis- 
tered tests of achievement. Extra time should also be given, as needed, on 
in-school tests so that Nakita's grades will reflect mastery of content, rather 
than suppression due to time constraints. 



A primary strength of using clinical judgment is the flexibility it affords the de- 
cision maker. A seasoned examiner will be quick to admit that the various data ac- 
cumulated during an evaluation do not always agree. There are times when two tests 
purporting to measure a similar construct may yield dissimilar results. There are 
times when teachers, spouses, mothers, and fathers who are asked the same set of 
questions about the same client will vary widely in their responses, sometimes due to 
varying perceptions, response bias, or the intent to deceive. In fact, it is more often 
the case that some data do conflict, thus requiring great skill and judgment on the 
part of the examiner to realize what to focus on and what not to focus on. In these 
instances, clinical judgment is indispensable as a tool for reckoning divergent infor- 
mation from diverse data sources. However, this same flexibility can also lead to ex- 
aminer bias and a decision-making process that lacks reliability (i.e., consistency) and 
usefulness. Indeed, some have demonstrated that statistical models, compared to 
clinical judgment models, lead to more reliable and valid decisions. 



1 56 Chapter 4 



Table 4.8 Example of a multiple regression/multiple cutoff 
hybrid decision-making model 

In the following scenario, a decision must be made to select the three most qualified applicants. 
X v X 2 , and X i are the scores on the selection tests. Y' is the predicted criterion score based on 
the multiple regression equation: V = a + b^X l + b 2 X 2 + b-Ji.^ . The minimum cutoff scores for 
each selection variable are X^ = 20, X 2 = 15, and X$ = 25. The "All met" column indicates 
whether the client's scores on each of the selection tests (X,, X 2 , and A",) met or exceeded the 
minimum cut score and, therefore, can be considered for final selection. "Final Rank" indicates 
the final ranked position of the "surviving" candidates. The top three ranked candidates (marked 
by an asterisk) will be deemed most qualified and offered the positions. (Note that Candidate H 
was selected even though Candidate J had a higher V , because X 2 for Candidate J was below the 
minimum criterion, effectively eliminating Candidate J from consideration.) 



Participant 


*, 


*2 


*3 


r 


All met 


Final rank 


A 


22 


20 


20 


22.65 


Yes 


5 


B 


18 


19 


25 


23.17 


No 


X 


C 


14 


12 


21 


18.74 


No 


X 


D 


20 


15 


25 


22.99 


Yes 


4 


E 


24 


19 


30 


27.20 


Yes 


1* 


F 


17 


16 


25 


22.21 


No 


X 


G 


22 


19 


28 


25.72 


Yes 


2* 


H 


21 


18 


25 


23.95 


Yes 


3* 


I 


11 


11 


12 


13.85 


No 


X 


J 


23 


14 


28 


25.00 


No 


X 



Note: * = the top three ranked candidates. 



Combining decision-making models 

Sometimes a combination of these three methods can lead to greater accuracy. For 
example, strict adherence to a multiple cutoff method may at times be softened by 
clinical judgment that takes into account a client's or student's extenuating circum- 
stances — circumstances not accounted for by the multiple cutoff method, but 
nonetheless important. This happens frequently with educational decisions (i.e., 
grade retentions, exceptions to course requirements, college admission and scholar- 
ship applications) and clinical decisions (i.e., use of the designation "Not Otherwise 
Specified"). Alternatively, multiple regression and multiple cutoff methods can be 
used in conjunction to select the "cream of the crop" in a two-stage process. Stage 1 
involves applying the multiple regression equation to client scores and rank ordering 
the client's scores according to the magnitude of the client's predicted criterion score 
(V). Stage 2 involves standard setting to determine the minimal cutoff for each se- 
lection test score and then applying these multiple cutoff criteria to (he same scores 
analyzed in stage one. This process, an example of which is provided in Table 4.8, 
may eliminate some of the individuals who benefited from the multiple regression 
process, which allowed a compensation for low scores, and may eliminate them from 



Validity 1 57 

final selection. Such a procedure is particularly helpful when the cost of selecting an 
unqualified person may be too prohibitive or the risk of failure too detrimental. Of 
course, any decision-making method has strengths and weaknesses, and will virtually 
never be foolproof. Selection of an appropriate decision-making model must be un- 
dertaken with great care to ensure the rights and protection of clients and students. 



SUMMARY/CONCLUSION 



KEY TERMS 



Test validity is about whether or not (and to what degree) a test score measures what 
it claims to measure. Validity is closely related to, and dependent on, test reliability. 
Evidence for test score validity is determined in several ways. Content validity con- 
siders the degree to which a test adequately represents the breadth of content of the 
domain being examined. Criterion-related validity correlates scores on the predictor 
variable (test score) with those on the criterion or outcome measure. Criterion-re- 
lated validity may be predictive, in which the predictor and criterion measures are 
gathered at different times, or concurrent, in which both are gathered at the same 
time. A prediction equation can be used to predict a person's score on the outcome 
measure from the individual's score on the predictor variable. The standard error of 
estimate indicates the degree of accuracy of predictions. Construct validity uses con- 
vergent and discriminant forms of validity assessment. Convergent construct valid- 
ity is established by showing high correlations between the new test and other estab- 
lished measures of the same or similar constructs. Discriminant construct validity is 
evidenced by low correlations between the new test and measures of unrelated con- 
structs. It is important to establish a clear definition of the construct in order to 
know what the test score means. Finally, the use of a given test is a based on informed 
judgment to be made by a competent counselor for the benefit of the client. 

Decision making using a single test score is generally done through one of three 
processes: setting a cutoff score, linear regression, and application of decision theory. 
Decision making using multiple tests frequently makes use of clinical judgment, 
multiple cutoff, or multiple regression methods. Each of these methods has strengths 
and weaknesses, and each requires varying degrees of expertise and sophistication. 
Most professional counselors use clinical judgment methods based on a theoretical 
framework and previous experience. 



clinical judgment decision theory 

concurrent criterion-related validity discriminant validity 

construct domain 

construct validity face validity 

content-related validity false acceptance 

convergent validity false rejection 

criterion intercept 

criterion-related validity linear regression 

cutoff scores multiple cutoff method 



1 58 Chapter 4 



multiple regression 

negative predictive power 

positive predictive power 

predictive criterion-related validity 

restricted range 

sensitivity 

slope 



specificity 

standard error of estimate 

total predictive value 

valid acceptance 

valid rejection 

validity 




TEST SELECTION 



CHAPTER 



5 



Selecting, Administering, 
Scoring, and Interpreting 
Assessment Instruments 
and Techniques 

by R. Anthony Doggett, Carl J. Sheperis, Susan Eaves, 
Michael D. Mong, and Bradley T. Erford 



This chapter begins with issues related to proper test selection, administration, 
and scoring, followed by discussion of proper interpretation of test scores from 
both norm-referenced and criterion-referenced tests. A section regarding the 
appropriate sources for obtaining information about assessment instruments has 
been included to assist the reader in proper test selection. Finally, common errors 
committed during the assessment process are discussed, along with recommenda- 
tions for addressing these issues. 



Appropriate test selection is crucial in the assessment process. Before selecting in- 
struments, the professional counselor must first determine the purpose for engaging 
in assessment activities. As discussed in Chapter 1, sometimes clinicians administer 
different tests to determine if the individual meets criteria for a particular diagnosis, 
to develop interventions or treatments for clients, to evaluate the integrity of services, 
or to evaluate the outcome of receiving treatment. In any of these cases, the profes- 
sional counselor must ensure that the instrument being used is adequate for the 

159 



1 60 Chapter 5 



stated purpose of the assessment. As such, the instrument must be normed (or 
criterion-referenced) on a representative population, contain items that are appro- 
priate for evaluating the current referral concern, have adequate psychometric prop- 
erties, and provide scores that lend themselves to appropriate outcome comparisons. 
Choosing instruments that are not linked to the original purpose of the assessment, 
lack technical adequacy, or are not appropriate for the referred problem or individ- 
ual will reduce the professional counselor's ability to meet the client's needs and 
could potentially expose the client to harmful and unwarranted experiences. Table 
5.1 offers summary suggestions that professional counselors should consider when 
selecting an instrument. 



TEST ADMINISTRATION 



After determining the purpose of the assessment, the professional counselor must 
determine the best way to obtain the information needed from the client. While as- 
sessments are designed to yield meaningful information about a client, the quality of 
the information obtained is closely linked to the skills and abilities of the clinician 
administering the test. 



Administrator Requirements 



It is important to mention that each assessment instrument requires a certain level 
of training and/or education by the administrator. In other words, legally and ethi- 
cally, professional counselors can select instruments only from the category of instru- 
ments available for use according to their level of training. In clinical practice, these 
requirements are often determined by state licensure laws. In schools and agency 
work, state certifications or exemptions often exist that allow examiners to adminis- 
ter tests they would otherwise be unable to use in the private sector. For example, 
unlicensed professional counselors working in a correctional institution or for a non- 
profit agency may be allowed to administer clinical or intelligence tests as a condi- 
tion of employment. But, because they not licensed by a state counseling board, they 
may not be able to administer those same tests to the public for a fee in a private 
practice. The same is often true for professional school counselors and school psy- 
chologists. They may be able to administer the Wechsler Intelligence Scale for 
Children — Fourth Edition ( WISC-IV) or Woodcock-Johnson Tests of Achievement — 
Third Edition {WJ-III ACH) during school hours to students as a condition of em- 
ployment, but be prohibited from administering these same tests in private practice 
for a fee. While some instruments can be used with knowledge gained from the man- 
ual, others require in-depth supervised training. To assist in the process of delineat- 
ing which instruments require which level of training, a majority of publishers use a 
level system similar (albeit not identical) to that described below. 

Level A 

1 evel A instruments can be administered, scored, and interpreted after studying the 
manual, with no additional training or education required. However, employment 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 161 

Table 5.1 Guide to proper test selection 

Test information 

■ What is the name of the test? 

■ Who are the test authors? 

■ What company published the test? 

■ When was the test published? 

■ Are alternative forms of the test available? 

■ How much does the test cost? 

■ How long does it take to administer the test? 

■ Is the test manual comprehensive (i.e., includes information on psychometrics, norms, item 
development, etc.)? 

■ Does the test have current norms and items? 

■ Who is included in the standardization sample? 

Test interpretation aids 

■ Does the manual provide clear descriptions of the purposes and applications of the test? 

■ Does the manual provide clear information regarding the training and qualifications needed 
to administer the test? 

■ Does the manual include example cases to aid in interpretation of the results? 

Examinee considerations 

■ What skills are needed by the examinee to take the test? 

■ In what language are the test items written? 

■ What is the reading/vocabulary level of the test items? 

■ How are the test items presented? 

■ How is the examinee expected to respond to the test items? 

■ What adaptations can be made to the test items or test presentation to accommodate any 
examinee disabilities? 

■ Is the test free from bias? 

■ Is the test administered to individuals or groups? 

Technical adequacy 

■ What types of reliability studies have been performed on the test scores? 

■ What types of validity studies have been performed on the test scores? 

■ Are the reliability and validity estimates adequate for the intended purpose? 

Administration and scoring 

■ Are the directions for administering the test appropriate and clear? 

■ Are the directions for scoring the test appropriate and clear? 

■ What options are available for scoring the test? 

Interpretive scores and norms 

■ Are the scales used for reporting test scores adequately presented and described? 

■ Are the normative scores presented in an appropriate format (e.g., standard scores, percentile 
ranks)? 

■ Is the standardization sample appropriate and clearly described? 

■ If more than one form of the test is available, are equivalent scores on the different forms 
provided? 

■ Does the test manual provide guidance on establishing local norms? 



162 Chapter 5 



or affiliation with an institution or organization is sometimes required before the 
publisher will agree to send the instrument. The Self-Directed Search (Holland, 
Fritzche, & Powell, 1994) is a Level A test. 

Level B 

Level B instruments require specialized knowledge of psychometric issues and test 
score properties, usually obtained by taking a graduate-level course in assessment. To 
qualify for this level's criteria, the professional counselor administering the test must 
have a master's degree in counseling, psychology, or a related field. In addition, the 
professional counselor must have specific training and/or licensure or certification 
recognized by the test publisher. The Reynolds Adolescent Depression Scale — Second 
Edition RADS-2, (WJ-III ACH), and Slosson Intelligence Test — Revision 3 SIT-R3 are 
examples of Level B tests. 

Level C 

Level C instruments require substantial knowledge about the construct being meas- 
ured and about the instrument being used. Often, a doctorate in counseling, psy- 
chology, or a related field and/or appropriate licensure or certification is required. In 
addition, the professional counselor should have specific coursework or training re- 
lated to assessment (generally) and to the instrument (specifically) or class of instru- 
ments (e.g., intelligence, personality, projectives). Test publishers commonly use the 
general levels described, although the designations sometimes vary. In addition, there 
are often exceptions and variations due to state laws or regulations that the profes- 
sional counselor should check prior to selecting an instrument. The Rorschach 
Inkblot Test, the Wechsler Adult Intelligence Scale — Third Edition ( WAIS-III) , and the 
Minnesota Multiphasic Personality Inventory — Second Edition (MMPI-2) are examples 
of Level C tests. 

Finally, it is a magnificent practice to administer, score, and interpret a test 
under the supervision of a highly trained practitioner a number of times and on vol- 
unteer participants prior to using the test for decision-making purposes with clients. 
How many "practice administrations" depends largely on the complexities of the 
test. Practice administrations allow professional counselors to hone their skills on a 
new instrument under competent supervision and in no-risk situations to enhance 
the ultimate competence of the examiner. 



Examinee Preparation 



The first step in any assessment is to prepare the test takers for the test they are about 
to take. Because many standardized tests are administered in school settings or to 
school-aged youth in agency or private settings, professional counselors who work 
with children and adolescents must be able to adapt their assessment skills and 
knowledge toward younger age groups. 

The professional counselor's job is to familiarize clients and students with the 
type of assessment they will be taking. This may seem like common sense, but many 
professional counselors fail to take into account the client's familiarity with the test 






Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 63 

and the testing procedure. People should be informed of the type of test (e.g., math 
achievement, career interest inventory, personality inventory), whether the test is 
timed, and what the test is designed to measure. Professional counselors should ap- 
proach the assessment positively while helping clients and students to ease and man- 
age their test anxiety. 



Environmental Concerns 



Testing Procedures 



Another important aspect of preparing students for assessments is preparing and 
maintaining the proper testing environment. The proper assessment environment is 
one that is distraction-free, provides proper space for working, and discourages 
cheating. While the ideal assessment environment can be difficult to provide, test 
administrators should strive to ensure there are relatively few distractions during the 
testing process. Minimizing distractions can be accomplished by not allowing exam- 
inees to wander around the room, make unnecessary noise, or have materials unre- 
lated to the test with them in the testing session. Likewise, the examiner should en- 
sure the testing environment provides sufficient lighting, temperature, and work 
space for the task at hand. 



The actual process of test administration is often very straightforward. Standardized 
test administration is a very rigid and scripted process. The primary requirement for 
administering a published test is that the examiner strictly follows testing procedures 
described in the test's manual. Due to the many variations in testing procedures 
found in different published tests, it is critically important that the test administra- 
tor be familiar with the specific test procedures and materials used. Testing proce- 
dures can include, but are not limited to, test directions, time limits, and registration 
and identification procedures. 

The majority of test manuals stress the importance of the manner in which test 
directions are given. In most cases, test directions are to be read word for word fol- 
lowing a script that is laid out in the manual. Any deviation from the protocol may 
result in invalid test results. The primary function of verbatim instructions is to en- 
sure that uniform testing conditions are present for all test takers. Whenever possi- 
ble, professional counselors should memorize directions for administering and scor- 
ing test items. Even though the test manual is still referred to, memorization tends 
to help the administration flow more seamlessly, significantly reducing pauses by the 
administrator to locate a needed passage or judge the accuracy of a client response. 
Thus the professional counselor's demonstrated knowledge of and comfort with the 
test helps to establish a relational rapport and projects administrator competence and 
confidence. 

While not all tests employ time limits, time limits are frequently a vital part of 
the testing procedure. Test administrators should be familiar with the time limits for 
different items or subtests. Administrators should also carry some sort of timing 



1 64 Chapter 5 

device (i.e., stopwatch, wristwatch, clock, egg timer) with them so that they are 
aware of the time limits at all times. Ending a testing session too early or ending late 
may result in invalid test results. 

Many published tests also have specific procedures for examinee registration and 
identification, particularly high-stakes aptitude or achievement tests (e.g., SATs, 
graduate record examinations, advanced placement examinations). Sometimes ex- 
aminees must identify themselves through means such as their names or Social 
Security numbers. In an attempt to discourage cheating, some tests also require ex- 
aminees to present one or more forms of identification, both before and after a test- 
ing session. Professional counselors conducting assessments for employers, govern- 
ment program eligibility, or even community mental health services must be equally 
vigilant to ensure that client results are accurate and legitimate. 

Despite the test publisher's vigorous attempts to provide a uniform testing ex- 
perience for all examinees, there are sometimes deviations. According to the 
Responsibilities of Users of Standardized Tests (RUST-3) statement (AACE, 2003a) and 
Standards for Educational and Psychological Testing (AERA/APA/NCME, 1999), any 
deviations from the test procedure should be documented by the test administrator. 
Many test protocols contain a section in which the examiner may record and de- 
scribe problems or unusual circumstances that may occur during the testing session. 
The professional counselor should take any irregularities under consideration when 
interpreting test results. 

While deviations from standardized testing procedures are not required for the 
average examinee, test administrators should be aware of the special considerations 
given to examinees with disabilities or to very young examinees. The majority of 
published group-administered tests (particularly those administered by schools, in- 
stitutions of higher education, or licensure or other professional boards) require that 
an examinee show proof of a disability before being given special accommodations 
under the Individuals with Disabilities Education Improvement Act of 2004 
(IDEIA), the Americans With Disabilities Act of 1991 (ADA), or Section 504 of the 
U.S. Rehabilitation Act of 1973. While there is no set standard on the requirement 
of proof of disability, many institutions or test publishers require that the examinee 
in question have a written report on file that documents the disability. The reports 
must come from a legitimate source, usually a licensed specialist, and must be cur- 
rent (usually less than three years, depending on the test). Common considerations 
for individuals with disabilities include extended time on tests, longer breaks, Braille 
tests, oral instructions, dictated responses, and computer-assisted technology. 

Factors Affecting Test Scores 

During the process of test administration, the test administrator should be aware of 
the many factors that can affect test scores. While the administrator should strive to 
maintain these variables at a minimum level, not all variables can be controlled. 
Table 7.1 (see Chapter 7) contains a summary of important test-related factors. A 
comprehensive treatise of these factors affecting client and student responses is pro- 
vided by It ford (2006). 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 65 

TEST SCORING 

By definition, test scores are simply the numerical result of testing. Test scores sum- 
marize the information obtained through the testing process by using numbers that 
the test administrator may interpret. The use of numbers allows test administrators 
to describe and quantify examinee performance in a standardized manner. 

Assessment instruments may be scored by a wide array of people. For example, 
some instruments are designed to be self-scored (i.e., Level A tests). This type of scor- 
ing usually consists of adding columns of scores or counting the number of items re- 
sponded to. Some tests can also be scored by persons other than the client or exam- 
iner (e.g., clerical staff, interns). While having others score assessments may save the 
test administrator time on the front end, the test administrator should always 
recheck the test scores to minimize the chance of error. Under most circumstances 
in clinical practice, the professional counselor will score the protocols for Level B 
and Level C tests. As stated above, the reason for this practice is because use of Level 
B and C tests requires advanced education and training. 

While most assessment instruments may be scored by hand, computer-assisted 
scoring programs are becoming increasingly common. Some tests may also provide 
templates to aid the examiner in scoring the test by hand. Despite the aid offered by 
test templates, hand scoring for many tests is tedious to even the most experienced 
examiner. Due to the increased time consumption necessary for hand scoring, many 
examiners prefer to use computerized scoring programs and services for the longer or 
more complicated tests. For example, for the MMPI-2, computer scoring time is vir- 
tually instantaneous after the items are entered into the scoring program (which usu- 
ally requires 5 to 10 minutes). Depending on the scoring program used, many to 
most of the MMPI-2 scales listed in Table 7. 10 (see Chapter 7) can be obtained in a 
matter of seconds. In contrast, using the scoring stencils may require well more than 
an hour to obtain the same set of scores. Of course, both methods have risks of in- 
accuracies due to human error. Thus, when using computerized scoring programs, it 
is essential to double-check all score entries; when using scoring forms or stencils, it 
is equally important to double-check the derived scores. 

Wise and Plake (1990) conducted a study in which computer scoring was com- 
pared to hand scoring. The researchers concluded that computer scoring is more ac- 
curate, faster, and more thorough than hand scoring. An added advantage of com- 
puter scoring is the fact that computers are completely unbiased. Unless modified 
by the examiner, computers will not discriminate against examinees on the basis of 
individual differences such as sex, religion, race, sexual preference, or socioeconomic 
status. Computers can also aid examiners in complex test interpretations that can 
take human interpreters days. Of course, this does not mean that the interpretations 
derived by the computer are more accurate than those of a skilled clinician. 

While computerized scoring procedures are a useful aid to clinicians, they are 
not infallible due to their reliance on human programmers. In an attempt to mini- 
mize computer scoring errors, the Standards for Educational and Psychological Testing 
(AERA et ah, 1999) requires test scoring services to provide documentation of their 
programming procedures. 



1 66 Chapter 5 



Despite the increasing availability of computer scoring programs, some types of 
tests require human interpretation. For example, projective personality tests usually 
require a professional counselor to interpret information that computers are unable 
to perform, although recent efforts have resulted in attempts to standardize scoring 
and interpretations of some techniques (Exner, 2002; McArthur & Roberts, 2005). 
Professional judgment may also be required for some individually administered in- 
telligence, aptitude, achievement, personality, and clinical tests. 

It is always important for the professional counselor to remember that test scores 
serve a wide variety of functions in a variety of different settings. School personnel 
can use test scores to determine student placement. Teachers use test scores to ana- 
lyze their lesson plans and teaching methods. Professional counselors can use test 
scores to communicate examinee performance to clients, parents, or other stakehold- 
ers. The common link among all the above examples is that professionals use test 
scores to guide them in their decision-making responsibilities. 



Professional Standards in Testing 



Although each test publisher generally includes a set of minimum standards for the 
examiner to follow, several professional organizations provide additional ethical 
guidelines or standards for proper test administration and scoring. For example, the 
American Counseling Association's Code of Ethics (ACA, 2005a), the RUST-3 state- 
ment (AACE, 2003a), the Standards for Educational and Psychological Testing (AERA 
et al., 1999), and the National Board of Certified Counselors' Code of Ethics (1989) 
all encourage professional counselors administering tests to use appropriate proce- 
dures, techniques, and strategies related to the consideration of individual differences 
in sex, gender, ethnicity, and socioeconomic status of the examinee. Table 5.2 is of- 
fered as an amalgamated guide for the proper administration and scoring of tests. 

Table 5.2 Summary guidelines for administering and scoring tests 

Examiner preparation 

1. Administer only tests for which you have been thoroughly trained. 

2. Read and learn all instructions. 

3. Adhere to standardization procedures. 

a. Cite instructions to examinees exactly as the test manual prescribes. 

b. Present test items according to prescribed time limits. 

c. Follow scoring guidelines rigidly. 

d. Document any deviations from standardized procedures or testing irregularities. 

4. Administer the test in an objective manner. 

a. Reinforce participation but give no indication of accuracy or inaccuracy of examinee's 
responses (e.g., "You're doing fine. Keep trying your best."). 

b. Remember that you are testing, not teaching. Pay close attention to verbal (e.g., 
intonation of voice) and nonverbal cues (e.g., eye glances, head nods). 

5. Administer the test in a natural manner. 

c. Achieve rapport with the examinee before administering any test items. 

d. Use standardized wording in a normal and nonthreatening manner. 

6. Prepare the testing environment by removing distractions and avoiding clutter. 

a. I Live the examinee lace away from doors, windows, or other areas that may distract 
attention Irom the test, 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 67 

b. Have the examinee complete the test in a quiet area. 

c. When possible, avoid testing the examinee when he or she presents as hurried, worried, 
or ill (unless these are the conditions that prompted the evaluation or the client's 
normal state). 

7. Provide optimum testing conditions. 

a. Provide the examinee with comfortable seating and make sure he or she can see the test 
materials clearly. 

b. Provide a well-lit room with a comfortable temperature. 

c. Provide instructions in a clear, audible voice at a moderate rate of speed. 

d. Help the examinee maintain interest through enthusiastic presentation of the items and 
attention for effort. 

e. Provide social attention and encouragement for general performance, not for specific 
items. 

f. For maximum performance tests, let the examinee know that you want to see how well 
he or she can do on this test administration. 

Test administration 

1 . Administer the test in an efficient manner. Have an efficient system for 

a. Recording answers. 

b. Viewing the manual without distracting the examinee. 

c. Bringing out test materials and storing them away after use. 

d. Avoiding delays. 

2. Make smooth transitions from (sub)test to (sub)test. 

3. Know test administration guidelines and test materials well enough to avoid overextending 
the test experience for the examinee. 

a. Always begin at designated starting points. 

b. Score each item correctly and efficiently. 

4. Learn how to appropriately handle distractions from the examinee. 

a. Avoid attending to inappropriate remarks. 

b. Ignore inappropriate movements if they are not distracting to the examinee's test 
performance. 

c. Redirect the examinee to the task at hand if remarks or movements become too 
distracting. 

Scoring the test items 

1. Know the scoring standards well, so you thoroughly understand the intent behind each 
item. 

2. Remember that scoring standards provide guidelines for scoring items. When in doubt, 
score examinee answers in relation to the intent behind the item. 

3. Review the guidelines in the manual to verify any unclear answers provided by the 
examinee. 

4. Check and recheck every step of the scoring procedure. 

5. Check and recheck all figures and calculations. 

Test storage and care of materials 

1. Place all examinee protocols and other information in client folders in a proper storage (i.e., 
locked) cabinet to protect the confidentiality of the responses and personal information. 

2. Store all materials in a safe, secure place to prevent unwarranted wear and exposure to 
untrained personnel. 

3. Replace any materials that are worn so that these materials do not become distracting to the 
examinee. 

4. Point to pictures with a finger or eraser of the pencil to avoid placing marks on the page. 

5. Replace any materials that are lost or damaged with objects identical to the original from the 
testing company. 



1 68 Chapter 5 



NORM-REFERENCED INTERPRETATION 



Tests are usually administered to assess important domains in the examinee's life. For 
example, intelligence tests evaluate cognitive functioning; achievement tests evaluate 
academic functioning; adaptive behavior measures evaluate important daily living 
skills (e.g., communication, motor skills, social functioning); career inventories 
measure interests, skills, and values; and clinical or personality measures evaluate 
inter- and intrapersonal functioning. When these large domains of functioning are 
assessed, the examinee's raw score is usually transformed and then compared to the 
performance of other individuals with similar characteristics (e.g., age, gender, eth- 
nicity). For a norm-referenced test, this population of individuals is referred to as the 
standardization sample, normative sample, or the norm group. The comparison 
scores are called derived scores and are placed into two groups: developmental scores 
and scores of relative standing (Salvia & Ysseldyke, 2004). 



Developmental Equivalents 



One type of transformed or derived score is called a developmental equivalent. The 
two most common types of developmental equivalents are age equivalents and grade 
equivalents. Both of these equivalent scores are obtained by determining the average 
score obtained on a test by different groups of examinees who vary in age or grade 
placement. Specifically, an age equivalent means that the examinee's raw score is the 
average (mean or median) performance for a particular age group. For example, if 
the average raw score for 1 1 -year-old children (1 1 years, months) on a particular 
test is 15 items correct out of a 30-item test, then any examinee obtaining a score of 
15 would receive an age-equivalent score of 1 1-0 (11 years, months). Therefore, 
the age-equivalent score is obtained by computing the mean or median raw score on 
a test for a group of children of a specific age. It is also important to note that age- 
equivalent scores are expressed in years and months with a hyphen between the year 
and the month (i.e., 1 1 years, 2 months is expressed as 1 1-2). 

A grade-equivalent score is obtained by computing the average (mean or me- 
dian) raw score on a test obtained by examinees in a specified grade. For example, if 
the average score of 6th-graders on a mathematics test is 25, then any examinee ob- 
taining a score of 25 is reported to have math knowledge at the 6th-grade level. 
Grade-equivalent scores are expressed in grades and tenths with a decimal between 
the two numbers (i.e., 6.5 refers to the average performance of children at the mid- 
dle of the 6th grade; 2. 1 refers to the average performance of children during the first 
month of the 2nd grade) 

Salvia and Ysseldyke (2004, pp. 92-93) appropriately pointed out five concerns 
when using age- and grade-equivalent scores. These are: 

1 . Systematic misinterpretation. Examinees who earn an age-equivalent score oMl- 
have answered as many questions correctly as the average for examinees that 
are 1 1 years of age. Obtaining this score does not mean that the examinee per- 
formed on the test in the same manner that an 1 1 -year-old student would have 
performed. In a similar fashion, a 2nd-grader and a 6th-grader may have both 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 69 

earned a grade equivalent of 3.0; however, it is very probable that they did not 
attack the items on the test in the same manner. Developmentally, their thought 
processes may be quite different. In other words, it is essential to communicate 
to clients, teachers, and parents that just because a 4th-grader receives a grade 
equivalent (GE) of 8.5, this does not mean the student is as "smart" as an 
8th-grader. 

2. Interpolation and extrapolation. It is important to remember that average age- and 
grade-equivalent scores are only estimates of functioning and represent groups 
of examinees that were not actually tested. Loosely defined, interpolation means 
guessing within the bounds of what is known. Thus, if one knows that a raw 
score (RS) of 25 yields a grade equivalent of 2.5 (GE = 2.5) and a raw score of 
35 (RS = 35) yields a grade equivalent of 3.5 (GE = 3.5), it is reasonable to con- 
clude that each raw score point between 25 and 35 raises the grade equivalent by 
0. 1 . Thus, a RS of 27 would be a GE = 2.7, and a RS = 33 would be a GE = 3.3. 
Whether this has been demonstrated empirically or not, such interpolations 
make sense because some empirical results do exist upon which to base a conclu- 
sion. Interpolation, while often somewhat inaccurate, is quite benign in compar- 
ison to extrapolation. Extrapolation involves guessing outside the bounds of what 
is known. Following with the example above, what grade equivalents might one 
assign to raw scores of less than 25, particularly if no one younger than a grade 
level of mid-2nd grade actually made up the norm group? Extrapolation provides 
these estimations. A test developer may extrapolate that the linear relationship 
noted between GEs of 2.5 and 3.5 continues in the downward direction. Thus 
the author assumes that a RS = 15 would yield a GE =1.5, and a RS = 19 would 
yield a GE = 1.9, etc. Of course, such guesswork without the benefit of empiri- 
cal support is shoddy at best, dangerous at worst. This is just one reason why de- 
velopmental equivalents should be avoided. 

3. Typological thinking. Examinees are always being compared to an average that 
does not actually exist. For example, the average American family may be re- 
ported to have 1.7 cars, with a 2.5-bedroom house, and 2.4 children. However, 
it is simply impossible to have 0.4 of a child. Therefore, the average score simply 
represents a statistical abstraction. 

4. False standards of performance. Students are expected to perform at their age and 
grade levels. Eleven-year-olds are expected to perform at the 1 1-0 level on a test, 
and 6th-graders are expected to perform at the 6.0 level. However, equivalent 
scores are constructed in such a manner that at least half (50%) of any age group 
or grade group will perform at or below the age or grade level, because half of 
the group always earns scores at or below the median. This means that a princi- 
pal who insists that all 2nd-graders complete the year reading at a GE = 2.9 or 
higher is being statistically naive. The professional counselor should explain that 
in the average classroom, only 50% of 2nd-graders can be expected to be at 
GE = 2.9 or higher. 

5. Scales are ordinal, not equal-interval. The scales often used to obtain age and 
grade equivalents are ordinal; therefore, the intervals are not equal. As a result, 
the scores on these scales cannot be added, subtracted, or multiplied. Thus 



170 Chapter 5 



school systems that determine student eligibility for remedial services by requir- 
ing the student's reading or math achievement to be "two grade levels below cur- 
rent grade placement" are being statistically inappropriate. A two-grade-level dif- 
ference yields very different results at different grade levels. 

It is essential to note that developmental equivalents are frequently misunder- 
stood, miscommunicated, and misused. While professional counselors should be 
aware of the existence of developmental equivalents and be prepared to explain 
them, professional counselors should avoid using developmental quotients when ex- 
plaining client or student scores. 



Scores of Relative Standing 



Unlike developmental scores, scores of relative standing have equal units of meas- 
urement. As such, scores on the same test for several different examinees of different 
ages can be compared. Additionally, different scores on several different instruments 
can be compared for the same person. The major types of scores of relative standing 
used in norm-referenced measurement include standard scores and percentile ranks. 
Figure 5.1 demonstrates the relationship between these scores. 

Standard scores 

Standard scores are raw scores that have been mathematically transformed to have a 
designated mean and standard deviation. A standard score expresses how far an ex- 
aminee's score lies in relation to the standard deviation of the norm group. Five com- 
monly used standard-score distributions include: z-scores, T scores, deviation IQs, 
normal-curve equivalents, and stanines. 

Z-scores 

A z-score has a mean of and a standard deviation of 1 . As such, a z-score simply in- 
dicates how many standard deviations above or below the mean a given score falls. 
A z-score is obtained by subtracting the mean of the norm group (M x ) from the ex- 
aminee's raw score (X) and then dividing by the standard deviation {SD X ) of the 

norm group \z = x I . Almost all z-scores (99.7%) lie between -3.0 and +3.0. 

If an examinee obtains a z-score of 2.0, the examinee has performed 2.0 standard de- 
viations above the mean of the group. A z-score of -1.5 is 1.5 standard deviations 
below the mean of the group. A z-score of is at the mean performance of the group. 
Z-scores are commonly used in empirical research studies. 

T scores 

In order to remove the - and + signs, z-scores are often transformed into other 
scores, such as T scores. A T score has a mean of 50 and a standard deviation of 
10. Many test manuals transform raw scores directly into T scores, but a z-score 
can be transformed into a T score using the following formula: T = 10(z) + 50. 
Using the examples above, a z-score of 2.0 would be transformed into a T score of 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 171 



Number of scores 



0.1% 



2% 



14% 



34% 34% 



68% 



96% 



14% 



2% 



0.1% 



55 



70 85 100 115 130 

Score on Wechsler Adult Intelligence Scale 

Figure 5.1 The normal curve and related standardized scores 



145 



70 (T = 10(2.0) + 50 = 70). Az-score of-1.5 would be transformed into aT score 
of 35 (T = 10(-1.5) + 50 = 35). Az-score of would be transformed into aT score 
of 50 (T = 10(0) + 50 = 50). T scores are commonly reported in behavioral, per- 
sonality, and clinical inventories. 

Deviation IQs 

Deviation IQs have a mean of 100 and a standard deviation of 1 5 or 16, depending 
on the instrument used (nearly all currently use SD = 15). All of the Wechsler Scales 
have a standard deviation of 1 5; however, the Slosson Intelligence Test — Revised (SIT- 
R3) (Nicholson & Hipshman, 1990) uses a standard deviation of 16. While most 
test manuals transform raw scores directly into deviation IQs [M = 100, SD = 1 5), a 
z-score can be transformed into a deviation IQ score using the following formula: 
Dev. IQ = 15(z) + 100. Therefore, an examinee with a z-score of 2.0 would have a 
deviation IQof 130 (Dev. IQ = 15(2.0) + 100 = 130). A z-score of-1.5 would be 
transformed into a deviation IQof 78 (Dev. IQ= 15 (-1.5) + 100 = 78. A z-score of 
would be transformed into a deviation IQof 100 (15(0) + 100 = 100). It is impor- 
tant to note that the formula would change if the instrument has a standard devia- 
tion of 16. For example, a z-score of 2.0 would be transformed into a deviation IQ 
of 132 (16(2.0) + 100 = 132). Deviation IQ scores are frequently reported for tests 
of intelligence, achievement, and perceptual skills. 

Normal-curve equivalents 

Normal-curve equivalents (NCEs) are standard scores with a mean of 50 and a stan- 
dard deviation of 21.06. The standard deviation is set at 21.06 because this transfor- 
mation divides the normal curve into 100 equal units or intervals. 



1 72 Chapter 5 



Stanines 

Stanines is shortened from the term "standard nines." Stanines are standard-score 
bands that divide a distribution into nine parts with a mean of 5 and a standard de- 
viation of 2. These scores are expressed as whole numbers from 1 to 9. When scores 
are converted to stanines, the shape of the original distribution changes into a nor- 
mal curve. Stanines are frequently provided by publishers of large-scale testing pro- 
grams. Their use should be limited, and caution in interpretation is warranted be- 
cause educators and parents often express concern that a client's score has dropped 
from, say, the fifth to the fourth stanine. In actuality, this "drop" could be a differ- 
ence of a single raw score point. 



Percentile Ranks 



Percentile ranks, also referred to as percentiles, are derived scores indicating the per- 
centage of individuals whose scores fall at or below a given raw score. It is impor- 
tant to note that the terms percentile rank and percentage correct are not the same. 
For example, a percentage score of 50 means 50% of the items were correct (a pro- 
portion of correct to total points), while an examinee who obtains a percentile rank 
of 50 on a standardized test has scored the same or better than 50% of the exam- 
inees in the norm group. Percentiles allow comparison of a client's score with other 
scores. Percentages only allow comparisons with some standards. Although per- 
centile ranks are fairly easy to understand, their psychometric properties limit their 
usefulness. Still, percentile ranks are essential staples in test interpretation because 
of their ease of understanding. Unlike z-scores orT scores, percentile ranks are not 
evenly distributed across the normal curve. In fact, raw score differences between 
percentile ranks are smaller near the mean of the distribution and larger at the ex- 
tremes of the distribution. 

It is also essential to understand that small differences in a client's raw score 
around the mean can lead to large changes in percentile rank. It is often helpful to 
explain percentile ranks using a visualization of a line of 100 individuals of the same 
age (or grade), with the 1st individual in the line being the lowest performer (e.g., 
poorest math student, least depressed, least hyperactive) and the 100th person in the 
line being the highest performer (e.g., best math student, most depressed, most hy- 
peractive). Thus an individual scoring at the 95th percentile rank exceeded the per- 
formance of 95% of same-aged peers. A person scoring at the 5th percentile rank 
outperformed only 5% of same-aged peers. Importantly, because the normal curve 
theoretically runs in each direction to infinity, it is theoretically impossible to achieve 
the percentile rank end points of or 100. 



Think About It 5.1 How would you interpret a percentile rank score of 
84 to a client being assessed lor depression? Be sure to include a good expla- 
nation of what percentile ranks arc. 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 73 



Table 5.3 SEM at a given age level 



Age (yr-mth) 



Reliability 



68% LOC (± 1 SEM) 



95% LOC (± 2 SEM) 



99% LOC (± 2.58 SEM) 



12-0-12-11 


0.80 


±6.7 


13-0-13-11 


0.83 


±6.2 


14-0-14-11 


0.86 


±5.6 


15-0-15-11 


0.90 


±4.7 


16-0-16-11 


0.93 


±4.0 


17-0-17-11 


0.96 


±3.0 



±13.4 

±12.4 

±11.2 

±9.5 

±7.9 

±6.0 



±17.3 
±16.0 
±14.5 
±12.2 
±10.2 
±7.7 



Note: Ages presented in years and months. Confidence intervals are reported in standard scores {M = 100; SD =15). 



Quartiles 

Percentile ranks that divide a distribution into four equal parts are called quartiles. 
With quartiles, each part contains 25% of the norm group. The first quartile (Ql) 
contains percentile ranks of <25; Q2 contains percentile ranks of 26-50; Q3 con- 
tains percentile ranks of 51-75; and Q4 contains percentile ranks of >75. 



Applying Standard Error of Measurement (SEM) to Test Scores 



The score that a client or student obtains on a given test is called the observed score. 
Recall from the discussion of reliability and standard error of measurement (SEAT) 
in Chapter 3 that all test scores have some measurement error, and this score error 
can be expressed using a band of confidence around the observed score to indicate 
the likely presence of the true score (i.e., the client's actual score if no measurement 
error was present). This confidence band reflects the test's standard error of measure- 
ment, which is influenced by a test's reliability (see Chapter 3 for an explanation of 
how SEM is computed). SEM is essential to test score interpretation because it is 
misleading to report a score as if it is "the truth, the whole truth, and nothing but 
the truth." Realistically, the score a client receives on a test may vary up or down on 
readministration of that test — and this is normal. The more reliable a test score is, 
the less variability will be expected upon retest; conversely, the lower the reliability, 
the greater the variability. 

Most test manuals and computer scoring programs provide SEM for standard 
scores (SS) obtained by students and clients. Sometimes this information is included 
in a table that indicates the SEM at a given age level and for a certain level of confi- 
dence, such as provided in Table 5.3. In these cases, the confidence interval (CI) is 
computed as CI = SS ± SEM. For example, if the observed score is a standard score 
of 105 and the SEM equals 5 standard-score points, then the confidence interval is 
105 ± 5 or a range of 100-1 10. However, an important consideration in determin- 
ing confidence intervals is the level of confidence to display. Recall from Chapter 3 
that ± 1 SEM is the 68% level of confidence, ± 2 SEM is the 95% level of confi- 
dence, and ± 2.58 SEM is the 99% level of confidence. Under normal circumstances, 



1 74 Chapter 5 



Table 5.4 Observed scores with ranges of standard scores 



Test 



Standard scores; range 



Percentile rank; range 



Interpretive range 



WISC-IV—IQ 
WJ-III— Math Calculation 
W] -III— Applied Problems 
VMI-4 



111; 101-121 
92; 82-102 
77; 67-87 
95; 85-105 



77; 53-92 
29;12-55 
6;1-19 
37; 16-63 



Average— Superior 
Low Average-Average 
Deficient— Low Average 
Low Ave rage- Ave rage 



Note: For the purpose of this example, it is assumed that 1 SEM = 5 standard score points for all four measures. Note that this is not usually the 
case. All scores are interpreted at the 95% level of confidence (i.e., ±10 standard score points). 



professional counselors should interpret scores at the 95% level of confidence (± 2 
SEM)., meaning the client's true score will probably lie within the given range 95 
times out of 100 (alternate-form administrations of the test). 

Table 5.4 presents an example of several observed scores with ranges of standard 
scores determined at the 95% level of confidence. Note that these scores have also 
been converted into percentile ranks and interpretive ranges. 

Notice that the observed WISC-IV IQ score is 111. Interpreting this score at the 
95% LOC (level of confidence) with 1 SEM equal to 5 standard score points means 
the range of scores surrounding the score is 111 ± 10, or 101-121 (i.e., if 1 SEM = 5 
SS points, then 2 SEM = (2x5)= 10 SS points; thus 1 1 1-10 = 101 and 1 1 1 + 10 = 
121). Next, these standard scores should be converted to percentile ranks to make 
them easier to explain to clients, students, parents, teachers, or other stakeholders. 
This can be easily accomplished by using Table 5.5. In this case, a deviation IQof 1 11 
converts to a percentile rank of 77. Also, a SS of 101 is a percentile rank of 53, and SS 
or 121 is a percentile rank of 92. Finally, the standard score range of 101-121 is con- 
verted to the appropriate interpretive ranges (i.e., brief verbal descriptors), which can 
also be found in Table 5.5. In this case, a SS of 101 is in the Average range, and a SS 
of 121 is in the Superior range. Thus the interpretive range is Average to Superior. 

A professional counselors interpretation of the scores in Table 5.4 when present- 
ing them to clients, teachers, parents, guardians or other stakeholders might go like 
this: 

Juan's performance on the WISC-IV exceeded that of 77% of other children his 
age. His true score probably falls in the percentile rank range of 53 to 92. This 
performance is Average to Superior. His score on the WJ-III ACH Math 
Calculation subtest exceeded the performance of 29% of other children his age. 
His true score probably falls in the percentile rank range of 12 to 55. This per- 
formance is Low Average to Average. Juan's performance on the WJ-III ACH 
Applied Problems subtest, a measure of math problem-solving abilities, ex- 
ceeded that of only 6% of other children his age. His true score probably hills 
in the percentile rank range of 1 to 19. This performance is Deficient to Low 
Average. Finally, his score on the Developmental Test of Visual-Motor 
Integration (VMI-4) exceeded the performance of 37% of other children his 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 75 



Table 5.5 Score conversion table 



IQ 



Percentile rank 


Scaled score S 


99.99 


19 


9 


99.98 


19 


9 


99.98 


19 


9 


99.97 


19 


9 


99.97 


19 


9 


99.96 


19 


9 


99.95 


19 


9 


99.93 


19 


9 


99.91 


19 


9 


99.89 


19 


9 


99.87 


19 


9 


99.83 


19 


9 


99.79 


19 


9 


99.74 


18 


9 


99.69 


18 


9 


99.62 


18 


9 


99.53 


18 


9 


99 


17 


9 


99 


17 


9 


99 


17 


9 


99 


17 


9 


99 


17 


9 


99 


17 


9 


98 


16 


9 


98 


16 


9 


98 


16 


9 


97 


16 


9 


97 


16 


9 


96 


15 


9 


96 


15 


9 


95 


15 


8 


95 


15 


8 


94 


15 


8 


93 


14 


8 


92 


14 


8 


91 


14 


8 


90 


14 


8 


88 


14 


8 


87 


13 


7 


86 


13 


7 


84 


13 


7 


82 


13 


7 


81 


13 


7 


79 


12 


7 


77 


12 


7 


75 


12 


6 



Stanine 



Z-score 



T score 



NCE 



Interpretive range 



155 
154 
153 
152 
151 
150 
149 
148 
147 
146 
145 
144 
143 
142 
141 
140 
139 
138 
137 
136 
135 
134 
133 
132 
131 
130 
129 
128 
127 
126 
125 
124 
123 
122 
121 
120 
119 
118 
117 
116 
115 
114 
113 
112 
111 
110 



+3.67 


87 


+3.60 


86 


+3.53 


85 


+3.47 


85 


+3.40 


84 


+3.33 


83 


+3.27 


83 


+3.20 


82 


+3.13 


81 


+3.07 


81 


+3.00 


80 


+2.93 


79 


+2.87 


79 


+2.80 


78 


+2.73 


77 


+2.67 


77 


+2.60 


76 


+2.53 


75 


+2.47 


75 


+2.40 


74 


+2.33 


73 


+2.27 


73 


+2.20 


72 


+2.13 


71 


+2.07 


71 


+2.00 


70 


+ 1.93 


69 


+ 1.87 


69 


+ 1.80 


68 


+ 1.73 


67 


+ 1.67 


67 


+1.60 


66 


+ 1.53 


65 


+ 1.47 


65 


+ 1.40 


64 


+ 1.33 


63 


+ 1.27 


63 


+ 1.20 


62 


+ 1.13 


61 


+ 1.07 


61 


+1.00 


60 


+0.93 


59 


+0.87 


59 


+0.80 


58 


+0.73 


57 


+0.67 


57 



99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


99 


Very Superior 


93 


Very Superior 


93 


Very Superior 


93 


Very Superior 


90 


Superior 


90 


Superior 


87 


Superior 


87 


Superior 


85 


Superior 


85 


Superior 


83 


Superior 


81 


Superior 


80 


Superior 


78 


Superior 


77 


High Average 


75 


High Average 


74 


High Average 


73 


High Average 


71 


High Average 


59 


High Average 


68 


High Average 


67 


High Average 


66 


High Average 


64 


High Average 




continued 



176 Chapter 5 



Table 5.5 continued 



IQ 



Percentile rank 



Scaled score Stanine 



Z-score 



T score 



NCE 



Interpretive range 



109 

108 

107 

106 

105 

104 

103 

102 

101 

100 

99 

98 

97 

96 

95 

94 

93 

92 

91 

90 

89 

88 

87 

86 

85 

84 

83 

82 

81 

80 

79 

78 

77 

76 

75 

74 

73 

72 

71 

70 

69 

68 

67 

66 

65 

64 

63 

62 



73 
70 
68 
66 

63 
61 
58 

55 

53 

50 

47 

45 

42 

39 

37 

34 

32 

30 

27 

25 

23 

21 

19 

18 

16 

14 

13 

12 

10 

9 

8 

7 

6 

5 

5 

4 

4 

3 

3 

2 

2 

2 



12 

12 

1 

1 

1 

1 

1 

10 

10 

10 

10 

10 

9 

9 

9 

9 

9 

8 

8 

8 

8 

8 

7 

7 

7 

7 

7 

6 

6 

6 

6 

6 

5 

5 

5 

5 

5 

4 

4 

4 

4 

4 

3 

3 

3 

3 

3 

2 



+0.60 

+0.53 

+0.47 

+0.40 

+0.33 

+0.27 

+0.20 

+0.13 

+0.07 

0.00 

-0.07 

-0.13 

-0.20 

-0.27 

-0.33 

-0.40 

-0.47 

-0.53 

-0.60 

-0.67 

-0.73 

-0.80 

-0.87 

-0.93 

-1.00 

-1.07 

-1.13 

-1.20 

-1.27 

-1.33 

-1.40 

-1.47 

-1.53 

-1.60 

-1.67 

-1.73 

-1.80 

-1.87 

-1.93 

-2.00 

-2.07 

-2.13 

-2.20 

-2.27 

-2.33 

-2.40 

-2.47 

-2.53 



56 
55 
55 
54 
53 
53 
52 
51 
51 
50 
49 
49 
48 
47 
A7 
46 
45 
45 
44 
43 
43 
42 
41 
41 
40 
39 
39 
38 
37 
37 
36 
35 
35 
34 
33 
33 
32 
31 
31 
30 
29 
29 
28 
27 
27 
26 
25 
25 



63 


Average 


61 


Average 


60 


Average 


59 


Average 


57 


Average 


56 


Average 


54 


Average 


53 


Average 


52 


Average 


50 


Average 


48 


Average 


47 


Average 


46 


Average 


44 


Average 


43 


Average 


41 


Average 


40 


Average 


39 


Average 


37 


Average 


36 


Average 


34 


Low Average 


33 


Low Average 


32 


Low Average 


31 


Low Average 


29 


Low Average 


27 


Low Average 


26 


Low Average 


25 


Low Average 


23 


Low Average 


22 


Low Average 


20 


Borderline 


19 


Borderline 


17 


Borderline 


15 


Borderline 


15 


Borderline 


13 


Borderline 


13 


Borderline 


10 


Borderline 


10 


Borderline 


7 


Borderline 


7 


Very Deficient 


7 


Very Deficient 




Very Deficient 




Very Deficient 




Very 1 )efii ient 




Very Deficient 




Vcr\ 1 >< tii ient 




Very 1 )efi< ient 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 77 



Table 5.5 continued 



IQ 


Percentile rank 


Scaled score 


Stanine 


Z-score 


T score 


NCE 


Interpretive range 


61 


0.47 


2 




-2.60 


24 




Very Deficient 


60 


0.38 


2 




-2.67 


23 




Very Deficient 


59 


0.31 


2 




-2.73 


23 




Very Deficient 


58 


0.26 


2 




-2.80 


22 




Very Deficient 


57 


0.21 






-2.87 


21 




Very Deficient 


56 


0.17 






-2.93 


21 




Very Deficient 


55 


0.13 






-3.00 


20 




Very Deficient 


54 


0.11 






-3.07 


19 




Very Deficient 


53 


0.09 






-3.13 


19 




Very Deficient 


52 


0.07 






-3.20 


18 




Very Deficient 


51 


0.05 






-3.27 


17 




Very Deficient 


50 


0.04 






-3.33 


17 




Very Deficient 


49 


0.03 






-3.40 


16 




Very Deficient 


48 


0.03 






-3.47 


15 




Very Deficient 


47 


0.02 






-3.53 


15 




Very Deficient 


46 


0.02 






-3.60 


14 




Very Deficient 


45 


0.01 






-3.67 


13 




Very Deficient 



Note: IQ means deviation IQ, or standard score (SS) {M = 100; SD = 15); %ile rank is a Percentile Rank (P)\ scaled score means (Af= \0; SD = 
3); stanine means (M = 5; SD = 2); z-score means (M = 0; SD = 1); T score means (A/= 50; SD =10); NCE means normal-curve equivalent (M - 
50; SD = 21.06). 



age. His true score probably falls in the percentile rank range of 16 to 63. This 
performance is Low Average to Average. 

It is important to note that the interpretations offered above are statistical inter- 
pretations. Statistical interpretation gives meaning and context to quantitative scores. 
Another type of interpretation that is equally valuable is called qualitative or contex- 
tual interpretation. In this type of interpretation, the professional counselor describes 
what tasks the client can and cannot do, or provides rich content descriptions to help 
the reader understand the nature of client developmental and clinical issues. The 
quality of contextual interpretations is determined primarily by the level of expertise 
and the theoretical or practical orientations of the professional counselor. For exam- 
ple, professional counselors who are expert in describing the characteristics of per- 
sonality disorders and the behaviors observed in a client with such a condition may 
be able to provide a rich contextual description of the clients current circumstances 
and how the personality disorder is expressed and affects the client. 

Often statistical and contextual interpretations are combined in evaluation re- 
ports. For example, when interpreting the results of a WAIS-III protocol, a more sta- 
tistically oriented interpretation may be appropriate, supplemented by contextual 
comments, as in the following example: 

Intellectually, Jaime currently performs in the Average to High Average range of 
general cognitive ability (Full Scale percentile rank = 82; percentile rank range = 
75-89), as measured on the Wechsler Adult Intelligence Scale-Third Edition 



1 78 Chapter 5 



{WAIS-III). Her Verbal Comprehension skills were measured to lie in the Average 
to High Average range (percentile rank = 82; percentile rank range = 70-90), 
commensurate with her Perceptual Organizational skills, which also fell in the 
Average to High Average range (percentile rank = 73; percentile rank range = 
53-86). Because of these results, Jaime's Full Scale IQis the best choice of anchor 
scores to represent her educational and intellectual potential and to determine 
strengths and weaknesses. 

On the Verbal Comprehension subtests from the WAIS-III, Jaime displayed 
an intrapersonal strength on a task requiring social comprehension and problem 
solving (Comprehension subtest percentile rank = 99; Very Superior). No intra- 
personal weaknesses were noted as her profile of verbal cognitive performance 
was well balanced. She performed in an Average to High Average capacity on 
tasks requiring verbal abstract reasoning (Similarities subtest percentile rank = 
75), and general information (Information subtest percentile rank = 63). Her 
word knowledge and facility performance (Vocabulary subtest) exceeded 95% of 
age-mates, falling in the Superior range of performance. 

On the Perceptual-Organizational subtests of the WAIS-III, Jaime dis- 
played no significant strengths, but did display a significant intrapersonal 
weakness on a task requiring nonverbal spatial reasoning (Block Design sub- 
test percentile rank = 25; Low Average to Average). Nonverbal spatial reason- 
ing is usually associated with math problem solving and advanced mathemat- 
ical reasoning, an area that Jaime has claimed as a challenging academic subject 
since her elementary years. Jaime's performance on a task of logical reasoning 
(Matrix Reasoning subtest percentile rank = 95) fell into the High Average to 
Very Superior range, while her ability to sequence socially meaningful stimuli 
(Picture Arrangement percentile rank = 75) and to attend to visually detailed 
missing elements (Picture Completion percentile rank = 75) both revealed an 
Average to High Average capacity. 

Jaime's Working Memory Index score from the WAIS-III fell into the 
Average to High Average range (percentile rank = 73; percentile rank range = 
55-84), commensurate with current ability estimates. Her performance on the 
Letter-Number Sequencing subtest (percentile rank = 63; Average to High 
Average range) was slightly less developed than her performance on the Digit 
Span subtest, which fell in the High Average to Very Superior range (percentile 
rank = 95). Both areas were better developed than her Arithmetic subtest (per- 
centile rank = 37; Average), a traditionally poor area of achievement for Jaime. 
Overall, little distractibility in the auditory channel appears to exist. 

Jaime's Processing Speed Index score from the WAIS-III fell in the 
Borderline to Average range (percentile rank = 18, percentile rank range = 
8-42), very significantly below current ability estimates, given Jaime's Average 
to High Average intellectual capabilities — a 32-point discrepancy. Jaime's psy- 
chomotor speed and short-term visual memory (Coding subtest) and her speed 
in processing visual information (Symbol Search subtest, which does not have 
a memory component) both fell into the Borderline to Average range. These 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 79 

results are important, because distractibility frequently shows up in a client's 
cognitive profile as a short-term memory deficiency. As will be seen in the WJ- 
III fluency testing that follows below, Jaime displays a processing speed defi- 
ciency. In addition, these and previous assessment results documented an intra- 
personal weakness in short-term visual memory. 

As a second example, a more contextual description can sometimes help those 
who will work with the client better understand the client's current situation: 

Because a question arose regarding whether Ben possessed significant prob- 
lems with inattention, clinical and behavioral assessments focused on the pres- 
ence of age- and ability-inappropriate levels of distractibility, a primary symp- 
tom of an Attention-Deficit/Hyperactivity Disorder (AD/HD). Miss Wallace 
(2nd-grade teacher), and Mrs. Davis (reading teacher), educators who have in- 
structed Ben and who are well acquainted with his academic and behavioral 
performance, completed the Conners' Teacher Rating Scale — Revised, long 
Version (CTRS-R:L). Miss Wallace also completed the Acbenbach System of 
Empirically Based Assessment {ASEBA) Teacher Rating Form (TRE). Mr. and 
Mrs. Smith completed the Conners' Parent Rating Scale — Revised, long Version 
(CPRS-R:!). Mr. and Mrs. Smith and Mrs. Davis reported substantial con- 
cerns related to inattention and disorganization; Miss Wallace did not. All were 
in agreement that Ben displayed the following behaviors associated with inat- 
tention to a significant degree: forgets things he has already learned and has 
difficulty engaging in tasks requiring sustained mental effort. In addition, Mr. 
and Mrs. Smith and Mrs. Davis agreed that Ben frequently fails to give close 
attention to details and makes careless mistakes, has difficulty organizing tasks 
and activities, and is easily distracted by extraneous stimuli. Mrs. Smith and 
Mrs. Davis also agreed that Benjamin frequently does not seem to listen to 
what is said and has difficulty sustaining attention on tasks. Finally, Mr. and 
Mrs. Smith agreed that Ben does not follow through on instructions, fails to 
finish assigned work, and loses things necessary for tasks and activities. Each 
of these behaviors is a criterion for diagnosis of AD/HD — Predominantly 
Inattentive Type, and Ben fulfills the diagnostic criteria for this condition. All 
other behavioral and personality characteristics were reported to be within nor- 
mal limits, although some concern over social relationships and development 
was expressed by Miss Wallace. 

Such descriptions not only provide contextual understanding, but can aid treat- 
ment planning and outcomes evaluation. 



Think About It 5.2 Why is it important to demonstrate a client's scores 
as a range instead of as an individual score? How would you explain this 
process to a client? 



1 80 Chapter 5 



CRITERION-REFERENCED INTERPRETATION 



Single-Skill Scores 



As mentioned previously, norm-referenced scores compare an examinee's perform- 
ance to other individuals in the norm group who share similar characteristics. 
Criterion-referenced scores, on the other hand, compare the examinee's scores against 
an absolute standard (i.e., criterion) of performance. In other words, this form of 
testing measures levels of mastery. As such, performance on criterion- referenced test- 
ing is often helpful in making important instructional decisions regarding the mas- 
tery of specified curriculum goals and objectives or diagnostic decisions when a cer- 
tain number of criteria or level of severity is required. Criterion-referenced 
interpretation is often divided into two categories: single-skill scores and multiple-skill 
scores. 



Single-skill scores can be obtained for almost any target measured against an estab- 
lished criterion. However, most single-skill targets are related to academic, occupa- 
tional, or social domains. For example, an educator may score a math problem 
worked by a student. A vocational rehabilitation counselor may evaluate the feeding 
ability of an individual who recently experienced a stroke. An observer may note the 
number of adult instructions with which a referred child complies. Scoring can be 
dichotomous (e.g., pass-fail, right-wrong) or continuous (i.e., allowing partial credit 
for the item). In this case, each point on the continuum (e.g., never, seldom, often, 
always) would have to be carefully defined. In single-skill probes, raw scores are often 
transferred into a ratio. For example, an examinee may correctly complete 40 of 50 
items on a test. Therefore, the score would be represented as 40/50. 



Multiple- Skill Scores 



Many activities are not comprised on single-skill units but contain multiple skills. 
For example, measures of oral reading involve decoding or words, fluency, knowl- 
edge of grammatical rules, and, often, comprehension of material read. Additionally, 
educators often obtain answers to several questions on a mathematics exam com- 
posed of varying calculations (e.g., addition, subtraction, multiplication, division) 
rather than an answer to one problem. Multiple-skill scores are often divided into 
three areas of reporting: accuracy, retention, and verbal labels for percentages (Salvia & 
Ysseldyke, 2004). 

Accuracy 

An accuracy percentage is obtained by dividing the number of correct responses pro- 
vided by the examinee by the total number of items and then multiplying by 100. 
For example, a student who correctly responded to 9 out of 1 items on a test would 
receive a percent correct score of 90 (9/10 x 100 = 90%). Although educators often 
convert raw scores into this format to report student outcomes, remember that such 
scores are not equivalent in the same way as standard scores. A score of 90% on a 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 181 

mathematics test is not the same as a score of 90% on a spelling test, because the 
subject content is completely different and the items are presented in different for- 
mats as well. Note that the score of 90% does not allow for comparison of scores. 
The 90% could be the highest or lowest score in a distribution, and without access 
to the distribution of scores, further comparative analysis is hindered. 

Retention 

Retention refers to the percentage of information previously learned that is remem- 
bered at a later date. It also has been referred to as recall, memory, or maintenance. 
Retention is calculated by dividing the initial number of items remembered by the 
total number of items initially learned and then multiplying by 100. For example, an 
examinee may have learned 50 new words and recalled 40 of them two weeks later. 
This examinee's retention would be 80% (40/50 x 100 = 80%). 

Percentages expressed as verbal labels 

Sometimes percentages are expressed as labels. Two methods in which percentages 
are expressed as labels include level of performance and grades. Level of performance 
is often divided into two levels: mastery level and instructional level (Salvia & 
Ysseldyke, 2004). In many educational contexts, mastery is set at 90% or above, and 
nonmastery is set at any percentage below 90%. 

Instructional level is further divided into frustrational, instructional, and inde- 
pendent levels of performance. Frustrational-level performance is usually defined as 
less than 85% correct. Instructional-level performance is defined as 85-95% correct. 
Independent-level performance is defined as above 95% correct. 

Grades have also been used as verbal labels for percentages. For example, many 
college professors use a grading scale in which any one scoring 90—100% correct 
would receive a grade of an "A," anyone scoring 80-89% correct would receive a 
grade of a "B," and anyone scoring 59% or below would receive a grade of "F." 



SOURCES OF INFORMATION ABOUT TESTS 

Selection of an assessment tool is an important clinical decision and a vital part of 
the counseling process. The information gained from the assessment itself often 
serves as the foundation of counseling as it gives the professional counselor much 
necessary information that will aid in determining therapeutic goals and interven- 
tions and which will be a great asset in measuring progress and outcomes. 

Professional counselors must carefully choose instruments that are designed to 
address the referral questions and which are appropriate to their levels of education 
and training. However, the amount of assessments available today can prove over- 
whelming, and many professional counselors find themselves relying on less than ap- 
propriate tools simply out of habit or lack of information. This kind of choice is not 
necessary given the resources available to help make an informed decision regarding 
assessment selection. Although there is no one source that contains every assessment 
tool developed, there are a variety of sources that professional counselors should 



1 82 Chapter 5 



Table 5.6 Evaluation of sources of information about tests 



Source type 



Advantages 



Disadvantages 



Test manuals 

Publisher catalogs 

Test review volumes 
Journals 

Textbooks 
Electronic sources 



Usually contain much information about 
theoretical basis, item development, reliability, 
validity, standardization, and norms. Are often 
the best single source. 

Provide current information on tests, even on 
new tests not found elsewhere. Give costs and 
ordering information. 

Offer critical reviews by experts, with evaluation 
of weaknesses and strengths of each test. 

Give research on issues in testing. Often show 
application of test. Contain validity and 
reliability studies. 

Give in-depth information on certain tests. 
Provide an overview of tests in general. 

Give easy access to current information. Are easy 
to search by subject matter. Provide links to 
other sources. 



Test authors vary in comprehensiveness and 
psychometric sophistication. External empirical 
validation of results is not available for years after 
publication of the manual. 

Information may be biased. Necessary basic 
information is often lacking. 

Information is often dated. Reviews often do not 
include a thorough discussion of purposes of test. 

Information is often theoretical and technical and 
may be dated due to publication backlog. 

Information may be biased, dated, 
oversimplified, or technical. 

Information may be biased or incongruent in 
presentation. Access to information may be 
difficult for some. 



search on a regular basis. To help in this process, several sources are listed below (and 
in Table 5.6) that provide assistance in selecting and evaluating tests most suitable 
and technically sound. 



Think About It 5.3 When deciding which test to administer to a client, 
why would it be important to thoroughly research the test using several of 
the resources described in Table 5.6? 



Published Resources 



One of the most basic and essential assessment resources is the Mental Measurements 
Yearbook (MMY). First published in 1938 by Oscar K. Buros, this series of yearbooks 
is currently published by the Buros Institute of Mental Measurements of the 
University of Nebraska — Lincoln. This series of yearbooks contains thorough cri- 
tiques of many commercially available instruments. Each A/A/Yincludes descriptive 
information about each test, including the purpose of the instrument, for whom the 
instrument is appropriate, cost, and the publisher. Additionally, the yearbooks con- 
tain critical reviews of each instrument, written by knowledgeable professionals. 
These reviews contain the strengths and weaknesses of the instrument. 

Another resource published by the Buros Institute is Tests in Print, which is es- 
pecially useful for quickly identifying which instruments are most appropriate for a 



PRO-ED 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 83 

Table 5.7 What to include in a test critique 

1. Exact name of the instrument (or technique) 

2. Author (person, organization, or company) 

3. Publisher 

4. Copyright date(s) 

a. Date first published 

b. Date(s) of revision (s) 

c. Date of version being reviewed 

5. Purpose and recommended use 

6. Appropriate respondent characteristics (e.g., age, grade, reading level, mental abilities, 
physical characteristics) 

7. Available forms 

8. Current cost information 

9. Content 

a. Categories assessed or measured 

b. Types of items used 

c. Type(s) of responses required 

10. Administration procedures and requirements 

1 1 . Time factors and considerations 

12. Administtator qualifications 

13. Interpreter or user qualifications 

14. Scoting options and procedures 

15. Type(s) of scores derived or reported 

16. Normative data 

17. Validity information 

18. Reliability information 

19. Statistical information other than validity or reliability 

20. Multicultural issues 

21. Evaluation 

a. Limitations for use in counseling or student development 

b. Advantages for use in counseling or student development 



particular content area or for a particular use. Once a test is located, the professional 
counselor can then cross-reference it with the more thorough descriptions found in 
the MMY. For information relevant to critiquing tests, see Table 5.7. 



While the Buros Institute has several prominent resources, PRO-ED, Inc., has use- 
ful sources for locating and evaluating tests. One, Tests: A Comprehensive Reference 
for Assessments in Psychology, Education, and Business (PRO-ED, 2003), contains 
more than 3,000 published tests. Each test listed includes a brief description, a state- 
ment of its purpose, and information regarding cost, scoring, and the publisher. 
Tests, though not reviewed, are easily accessed through the classifications and cate- 
gories used to organize the resource. 

For reviews and evaluations of tests, PRO-ED provides Test Critiques, a series of 
volumes containing test critiques written by measurement and assessment experts. 



1 84 Chapter 5 



Publisher Catalogs 



Each critique includes emphasis on information that will aid the professional coun- 
selor using the test, such as guidelines for administration, scoring, and interpreta- 
tion. Especially helpful are the explanations of technical terms that will make the in- 
formation more understandable, even to those with little testing experience. 



Some test publishers will send catalogs upon request; others are available online. 
Catalogs can be especially useful for locating new tests and recent editions of previ- 
ously published tests — information that sometimes cannot be found in the sources 
discussed above. These catalogs provide information regarding uses of the test, cost, 
administration time, and other brief descriptions. 



Professional Journals and Textbooks 



Electronic Resources 



Other sources of information include professional journals and some textbooks. 
Journal articles often contain test reviews and may discuss the nature and use of par- 
ticular tests. These articles can be most easily located through electronic databases. 
The professional counselor will find many journals very helpful in finding current 
extant research on commonly used assessments, including Measurement and 
Evaluation in Counseling and Development, Educational and Psychological 
Measurement, Psychological Reports, Psychological Assessment, and. Assessment for 
Effective Intervention. Recently, desk references of different types of tests (e.g., 
Achievement Test Desk Reference [Flanagan, Ortiz, Alfonso, & Mascolo, 2002], 
Intelligence Test Desk Reference [McGrew & Flanagan, 1998]) have been published to 
assist examiners in selecting appropriate instruments. In addition, some textbooks 
contain appendices with lists of widely used testing instruments. However, texts such 
as this one mainly supply a brief overview of available instruments. 

Several resources exist to help the professional counselor identify and locate 
appropriate assessments in an efficient manner. When searching for an instrument 
that will test a specific content area, Tests or Tests in Print will provide quick infor- 
mation that can then be further explored in the Mental Measurements Yearbook. 
When additional information is needed to help in understanding the specific me- 
chanics of a test, Test Critiques may prove beneficial. Both of the latter resources 
provide sufficient information to weed out tests that are inappropriate or which 
have obvious weaknesses. Although all of the above publications provide compre- 
hensive coverage of available tests, catalogs, journals, and textbooks can also prove 
useful. 



Changes in technology have greatly improved access to possible assessment instru- 
ments. The Buros Institute of Mental Measurements provides test information 
through an electronic source, in addition to its printed version. Various search en- 
gines are available chat allow viewing of a large amount of information on tests and 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 85 

testing. "Test Reviews Online" is a web-based service of the Buros Institute of Mental 
Measurements, available at www.unl.edu/buros, and makes test reviews available to 
individual users exactly as they appear in the Ninth through Fifteenth Mental 
Measurements Yearbook series. In addition, monthly updates are provided from the 
institute's latest test review database. For a small fee, users may download reviews for 
over 2,000 tests that include specifics on test purpose, population, publication date, 
administration time, and descriptive test evaluations. 

Another service of the Buros Institute is Tests in Print. Tests in Print ( TIP) can 
be accessed through the above website and serves as a comprehensive bibliography to 
all known commercially available tests that are currently in print in the English lan- 
guage. Now in its sixth edition, TIP provides vital information to users, including 
test purpose, test publisher, in-print status, price, test acronym, intended test popu- 
lation, administration times, publication date(s), and test author(s). 77Palso guides 
readers to critical, candid test reviews published in the Mental Measurements Yearbook 
series. 

The Educational Testing Service (ETS) offers an electronic source for its test col- 
lection as well. The ETS Test Collection includes an extensive library of 20,000 tests 
and other measurement devices from the early 1900s to the present. The collection 
is advertised as the largest in the world and was established to make information on 
standardized tests and research instruments available to researchers, graduate stu- 
dents, and teachers. The ETS database can be accessed at www.ets.org/testcoll. From 
there, one can search by topic for instruments with each result providing descriptive 
information. Orders can also be placed at this site. 

PRO-ED has a useful source for locating and evaluating tests at 
www.proedinc.com/store/index.php. Through PRO-ED's online catalog products, 
available assessments can be located by a topic, title, or author name search. This 
search will give results with brief descriptions of each test, including price of test, 
materials included in each testing kit, and an option to place an order. 

Finally, another valuable electronic source can be found at http://aace.ncat.edu. 
This website is the home page of the Association for Assessment in Counseling and 
Education, a division of the American Counseling Association. Through AACE's 
"resources" option, professional counselors can find invaluable links to the ERIC test 
locator, some test reviews, assessment journals, and key documents such as Ethics in 
Assessment, Standards for Qualifications of Test Users, and Rights and Responsibilities of 
Test Takers: Guidelines and Expectations. 



COMMON ERRORS 



Regardless of level of training or expertise, professional counselors are human and 
are therefore susceptible to committing errors during the testing process. In inter- 
preting assessment instruments, professional counselors sometimes may commit in- 
ference and attribution errors. Although the assessments provide basic information 
about the client, the professional counselor must then sort the information and for- 
mulate overall conclusions and implications. While much is known about how to 
develop and evaluate psychological tests, much less is known about how to use the 



1 86 Chapter 5 



information generated. By familiarizing the professional counselor with common 
errors, it is hoped that these errors will be minimized in test interpretation and de- 
cision making. 

The tendency to seek confirmatory evidence (confirmatory bias) is one of the most 
common mistakes in test interpretation. Humans are prone to self-confirmation and 
often search for confirmatory information. In other words, one often believes what 
one wants to believe. Research supports this claim and shows that the human ten- 
dency is to search out and attend only to evidence that conforms to one's hypothesis. 
Though professional counselors have been trained to attend to all information in clin- 
ical decision making, they are just as prone to attend to narrow paths of evidence. 
Because of this tendency, professional counselors often conclude what they already 
suspect. This process of searching for confirmation can lead to inaccurate conclusions 
and may lead to an increased confidence in one's conclusions and abilities. Some evi- 
dence suggests that beginning counselors are particularly subject to confirmatory bias, 
thinking they understand the problem before they really do and, thus, working on 
the wrong problem. 

A second error commonly made is the tendency to see patterns where no patterns 
actually exist. Because humans strive for predictability in life, we are prone to attrib- 
ute order to ambiguous information. This tendency can have implications in test in- 
terpretation, as themes and patterns may be said to exist where none have actually 
emerged. 

Finally, the use of preconceived biases is a form of error commonly found in test 
interpretation. Primarily, there is a tendency to overpathologize clients. Professional 
counselors are prone to search for information indicative of pathology and then in- 
terpret this information in a way that indicates more pathology than may actually 
exist. This tendency is exaggerated when the client is from a lower social class, non- 
white, disabled, or female. 

Professional counselors must be aware of these common errors throughout the 
assessment process, as inaccurate decisions regarding clients can be easily made. The 
use of quality information provided in psychological assessments is not enough to 
remedy the errors involved in the interpretation process. Given these concerns, the 
following recommendations are provided: 

■ Do not confuse the ability to explain current data with the ability to predict fu- 
ture performance. 

■ Continue to assess skills over time instead of relying on one evaluation of the ex- 
aminee's performance. 

■ Collect data from multiple sources. Do not rely solely on self-report or observa- 
tion of one informant. 

■ Consider all other possibilities, and rule out alternative hypotheses. 

■ Choose the highest quality and most appropriate assessment instruments. 

■ Recognize personal biases, especially those pertaining to age, gender, class, and 
ethnicity. 

■ Be aware of the norms used during test construction, as well as the differences be- 
tween the client and the norm group used. 



Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 87 

As humans, professional counselors must continually strive to overcome any po- 
tential biases or attribution errors that may affect their decisions regarding client per- 
formance. While psychological tests improve the accuracy of decision making, care 
must still be taken in their interpretation and application. 



SUMMARY/CONCLUSION 



Testing involves administering questions to an individual or individuals in order to 
obtain a score. Assessment differs from testing in that it includes such processes as in- 
terviewing, records review, observations, rating scales, standardized testing, and 
many other provisions that create a larger process. When administering tests, take 
into account the test qualifications specified by the test's manual. Also, professional 
counselors should ensure that examinees are prepared and familiar with the proce- 
dures and process of testing before beginning testing. 

The testing environment is a very important aspect to consider when one wishes 
to obtain accurate results. One should always try to strictly follow time specifica- 
tions, directions, registration and identification procedures, and any other proce- 
dural guidelines laid out by the test's manual. If any deviation from the specified pro- 
cedures occurs during testing, it should be thoroughly documented by the examiner. 

Many factors can affect test scores. First, the examiner-examinee relationship 
should be one that is neutral. Reinforcement or negativity during testing can greatly 
affect scores. Professional counselors should always take into account individual dif- 
ferences when administering tests and interpreting scores. Furthermore, expectancy 
of the examiner can affect test scores. 

Scoring a test can allow quantification of scores and aid in interpretation. 
Several formats for scoring tests exist. Tests can be self-scored, scored by others, or 
scored by computers. While computer scoring is the most accurate form of scoring, 
computers are often incapable of making the judgments required for test interpreta- 
tion. 

Norm- referenced interpretation involves comparing the obtained score of the 
examinee to the norm group. These scores can be expressed in developmental equiv- 
alents such as age equivalents, which compare the examinee to others of the same 
age, and grade equivalents, which compare the examinee to others of the same grade 
level. There are many problems with making comparisons like those made in devel- 
opmental equivalents, and interpretation should be done with care. In order to in- 
terpret developmental equivalents, the test interpreter compares the examinee's 
chronological and mental age to obtain a developmental quotient. 

Scores on the same test for several different examinees of different ages can be 
compared by using scores of relative standing. Common types of scores of relative 
standing are standard scores, which have a designated mean and standard deviation. 
T scores, z-scores, deviation IQs, normal-curve equivalents, and stanines are com- 
mon types of standard scores. 

Criterion-referenced scores compare the examinee's scores against an absolute 
standard (i.e., criterion) of performance. Types of criterion-referenced scores include 



1 88 Chapter 5 



KEY TERMS 



single-skill scores, which assess a solitary academic, occupational, or social domain, 
and multiple-skill scores, which measure any area that is compowsed of several skills. 
Multiple-skill scores can be reported by expressing accuracy, retention, verbal labels 
for percentages, and instructional-level scores. 

There are many available sources of information on tests, test administration, 
test scoring, and psychometric properties of tests. The chapter covers in detail the 
many published sources of this information and electronic information. 

Lastly, professional counselors should take into account sources of error in test- 
ing and assessment. First, although professional counselors are trained to take all 
sources of information into account, they often make mistakes. Such mistakes in- 
clude overpathologizing clients, seeking to confirm their hypotheses with more evi- 
dence, and recognizing patterns that may not actually be present. The chapter in- 
cludes a series of steps and precautions to avoid this type and other types of error. 



developmental equivalent standardization sample 

deviation IQ standard score 

normal-curve equivalent stanine 

percentage test score 

percentile rank T score 

raw score z-score 
scores of relative standing 




CHAPTER 



6 



How Tests Are Constructed 

by Carl J. Sheperis, Carey Davis, and R. Anthony Doggett 



This chapter provides readers with preliminary information related to the con- 
struction and evaluation of psychological and educational tests, including: the 
purposes of tests; observables; item generation (multiple-choice, essay, true- 
false); technical analysis (item difficulty, item discrimination); and norms. The chap- 
ter also addresses the process of building quality tests that are aimed toward promot- 
ing valid score interpretation, and how to evaluate the use of a specific test for a 
specific purpose. Finally, the chapter reviews the fundamentals of test development, 
how to choose among already existing tests for a specific purpose, how to use the re- 
sults of standardized tests to help make decisions about individuals, and how to iden- 
tify flaws in assessment instruments and procedures. 

Many of you reading this book may be highlighting or underlining certain 
words or phrases to help yourself remember key information you might encounter 
on the next exam. As you study for that exam, you might also want to ask the in- 
structor some questions to help yourself prepare. First, you might ask the purpose of 
the test (e.g., the objectives of the test, the way it will be scored, and how the results 
will affect your final grade). Next, you might ask what content the test will cover 
(e.g., the chapters to be covered on the test and whether the questions will require 
memorization of facts or application of knowledge) . Finally, you might ask what the 
format of the test items will be (e.g., multiple-choice, short-answer, essay). 

When instructors are constructing a test, they, consciously or unconsciously, 
will be asking and answering similar questions: "What is the purpose of the test?" 
"How do I assess the content to be covered by the test?" and "How should I write 
the items on the test?" Similarly, identifying the purpose of a test, observables related 
to the test, item generation procedures, and test format are critical components of any 



189 



1 90 Chapter 6 



test construction process, whether the test is a simple one to be used in an elemen- 
tary school classroom, an examination for a graduate-level course, or a published 
psychological assessment instrument. However, appropriate test construction does 
not stop when items are developed. Development of a quality test requires appro- 
priate statistical analyses to determine item difficulty and item discrimination. Some 
tests, such as published psychological instruments, also use norms to help test users 
interpret test results. Each of the above concepts related to test development is dis- 
cussed throughout this chapter. The brief introduction to this material given in 
this chapter, however, will not provide adequate guidance to become a seasoned 
test developer; those interested in learning more about test construction should see 
Crocker and Algina (1986). 



PURPOSE OF THE TEST 



The first step in test construction is to define the general purpose of the test. The in- 
structor probably defined the general purpose of your next test on your syllabus (e.g., 
the test may assess class members' knowledge of the information from Chapters 1 
through 6 of this textbook and be worth 40% of your final grade in the course). 
Although the general purpose of a published test must be more formally defined 
than that of your next classroom exam, the basic principles are the same. Test devel- 
opment addresses the population taking the instrument (i.e., the members of the 
class) and the content of the test (i.e., knowledge of Chapters 1 through 6). 

Although course-related tests provide a very basic example of test construction, 
for a standardized test, the content of the test and the theory on which the test is 
based may be considerably more complex. There are many questions related to test 
purpose that the instructor does not necessarily need to consider when writing a 
course-related test — questions that are, however, crucial in constructing many other 
types of standardized tests. Test developers must consider such issues as whether a 
test will be norm referenced or criterion referenced, what objectives will be meas- 
ured, how items and scores will be scaled, and what approach to test construction 
will be used. Cohen and Swerdlik (1999, pp. 216-218) suggested that test develop- 
ers need to consider at least the following 14 questions prior to developing a test: 

1 . What is the test designed to measure? 

2. What is the objective of the test? 

3. Is there a need for this test? 

4. Who will use this test? 

5. Who will take this test? 

6. What content will the test cover? 

7. How will the test be administered? 

8. What is the ideal format of the test? 

9. Should more than one form of the test be developed? 

10. What special training will be required of test users for administering or inter- 
preting the test? 

1 1. What types of responses will be required by test takers? 

12. Who benefits from the results ol this test? 



How Tests Are Constructed 



191 



Examinees 



Goals and Theory 



13. Is there any potential for harm from administration of this test? 

14. How will meaning be attributed to scores on this test? 

In addition, the question "How does the test address multicultural/diverse popula- 
tions?" must be asked. 



For several reasons, it is important to define who will be in the normative sample 
when constructing a test. First, the age range of the test takers will be a factor in de- 
termining the content and how that content will be assessed. Also, the reading abil- 
ity of the test takers will affect the way the items are written and whether the test 
will be presented in written or oral form. Additionally, the cultural backgrounds of 
examinees may influence the items that are included on the test and the way items 
are presented. Finally, it is important to identify who needs to take the test and/or 
who would want to take it (Cohen & Swerdlik, 1999). 



The goals of any test are inherently based on a theory. For example, a typical class- 
room test is probably based on the theory that if the examinee is able to answer a 
certain percentage of questions correctly, the examinee is competent in knowledge of 
the course content. In this case, knowledge of course content is theoretically related 
to test performance. Standardized tests are often more complex, because test devel- 
opers writing an intelligence test would first have to choose a theory of intelligence 
on which to base the instrument. Likewise, test developers writing a personality test 
would have to define the aspects of personality the test would purport to measure. 
The theory on which a test is based links the content of the test to the constructs, 
characteristics, or attributes that the test is designed to measure. 



Norm Referenced or Criterion Referenced 



Once the theory that underlies the purpose of the test has been clarified, the next step 
in the test construction process is to decide whether a test should be norm referenced 
or criterion referenced. A norm-referenced test is one in which an individual's score is 
interpreted by comparing it with other individuals' scores (i.e., a normative sample); 
a criterion-referenced test is one in which an individual's score is interpreted in terms of 
a predetermined criterion of demonstrated skills (i.e., objectives) (Mehrens & 
Lehmann, 1991). A test developer's decision about whether a test should be norm ref- 
erenced or criterion referenced must be based on the purpose or goal of the test 
(Hopkins, 1996). For example, if a test is designed to assist employers in choosing 
from a large pool of potential employees, its goal should be to make comparisons 
among the candidates; therefore, a norm-referenced test would be appropriate. On 
the other hand, if the purpose of a test is to help a teacher determine whether individ- 
ual students have mastered certain instructional objectives in order to identify the ones 
who need additional tutoring in specific areas, a criterion-referenced test would be 



1 92 Chapter 6 



Objectives 



beneficial because it would yield information about the areas in which the students 
needed help instead of just comparing the students to each another (as a norm-refer- 
enced test would do). There are times when a test may be both norm referenced and 
criterion referenced. When you take your next test, your instructor will probably give 
you a grade based on a predetermined criterion, such as the number of questions you 
must answer correctly in order to pass the test. Such a grade would indicate that the 
test is to be criterion referenced. However, if your instructor gives you information 
about the class average on the test, enabling you to compare your score to the scores 
of your classmates, the test could become not only a criterion-referenced test but also 
a very simple norm-referenced test. 



Test developers who write criterion-referenced tests must carefully consider objec- 
tives when writing their tests. The terms objectives and goals may easily be confused, 
but in this discussion, the objectives refer specifically to instructional objectives meas- 
ured by criterion-referenced tests, whereas goals have a broader reference, applying to 
many types of tests. For example, when instructors write a class test (which is very 
likely to be an informal criterion-referenced test), they look at the objectives listed on 
the syllabus and write the test so that it measures those objectives; the goals of the 
test are much broader — primarily to determine whether students have mastered 
course content well enough to pass the course. When considering the objectives to 
be tested, test developers must take several factors into account. First, the specificity 
of the objectives will affect the way the test items are written (Hopkins, 1996). Also, 
Hopkins contended that tests that measure educational objectives must define these 
objectives in terms of "Bloom's taxonomy," which categorizes objectives into six 
hierarchical levels: knowledge, comprehension, application, analysis, synthesis, and 
evaluation. Objectives are important to consider in criterion-referenced tests, but not 
all tests measure objectives. For example, a personality test does not measure whether 
an individual has attained mastery of a certain personality type; instead, it measures 
a person's personality type. Many norm-referenced tests do not measure whether in- 
dividuals meet certain objectives. 



Scaling 



Another issue that test developers must consider is scaling, which is "the process by 
which a measuring device is designed and calibrated, and the way numbers (or other 
indices) — scale values — are assigned to different amounts of the trait, attribute, or 
characteristic being measured" (Cohen & Swerdlik, 1999, p. 219). In other words, 
scaling is basically attaching numbers to the construct that the test is theorized to 
measure. There are cases in which scaling is fairly simple. On your next test, each 
question will probably be assigned a point value, and your score will reflect the num- 
ber of questions you answer correctly. The example of your next test represents a 
summative scale, in which correct responses arc added together (summed) to calcu- 
late the final score. 



How Tests Are Constructed 1 93 

The example of the scaling for your next test is fairly straightforward; however, 
scaling can be an extremely complicated process. Scales may be defined in several 
different ways. For example, scales may be defined by whether they are nominal, or- 
dinal, interval, or ratio. Scales may also be defined by whether they are rating scales 
or comparative scales or by whether they are unidimensional or multidimensional. For 
example, some tests use rating scales, which require examinees to rate test items (i.e., 
"On a scale of 1 to 10, with 1 being poor and 10 being excellent, rate the service you 
received from your waiter"). On some tests that use such rating scales, the ratings are 
summed for the final score; therefore, they are summative tests (Cohen & Swerdlik, 
1999). Rating scales may take many forms. In some instances, true-false tests may be 
considered rating scales (i.e., "I felt depressed this morning. Circle one: True/False), 
or rating scales may be written as a series of faces — such as a sad face, a medium face, 
and a happy face — that examinees should circle. A very popular type of rating scale 
is the Likert scale, which allows examinees to choose from a continuum of five re- 
sponses, usually with Agree or Approve on one end of the continuum and Disagree 
or Disapprove on the other end. Comparative scales are somewhat similar to rating 
scales. When comparative scales are used, an examinee might be given items to sort 
or rank in a certain order (i.e., from most to least appealing, or from worst to best). 

Another way of defining a scale is whether it is unidimensional or multidimen- 
sional. Unidimensional scales are those in which numbers are assigned only to one di- 
mension; multidimensional scales are those in which several different dimensions may 
underlie the examinee's responses (Cohen & Swerdlik, 1999). For example, if a re- 
sponse to a test item may be interpreted in many different ways, it is likely that the 
item is part of a multidimensional scale. All of the scales mentioned to this point 
yield ordinal scores. 

Two other types of scales are the Guttman scale and the Thurstone scale. The 
Guttman scale is an ordinal scaling method in which items are arranged to form a hi- 
erarchy, so that an examinee who agrees with or confirms one item on the hierarchy 
also agrees with or confirms the items lower than that item on the hierarchy but dis- 
agrees with or disconfirms the items higher than that item on the hierarchy. The 
Guttman scale is also called the deterministic or monotone model. Thorndike (2005, 
p. 393) gave the following example of a Guttman scale: 

1 . Abortion should be available to any woman who wishes one. 

2. Abortion should be legal if a doctor recommends it. 

3. Abortions should be legal whenever the pregnancy is the result of rape or incest. 

4. Abortion should be legal whenever the health or well-being of the mother is en- 
dangered. 

5. Abortion should be legal only when the life of the mother is endangered. 

Such a graduated scale presumes that a respondent selecting response choice 1 
also agrees with the conditions listed in choices 2 through 5. Conversely, an individ- 
ual selecting choice 5 would be presumed to not agree with choices 1 through 4. 

The Thurstone scale is a scaling method that yields interval data (Cohen & 
Swerdlik, 1999). In this method, items are rated by a group of judges, and means 
and standard deviations of the judges' ratings are calculated for all of the items. 



194 Chapter 6 



Then, items on which most judges agreed (or items with low standard deviations) are 
included in the test. Finally, the examinee rates the items, and the examinee's score 
is determined by the judges' ratings of the items the individual selects. The 
Thurstone scale is also called the probability or nonmonotone model or the equal- 
appearing interval model. The type of scale that is used in a test should be selected 
according to the variables being measured and the examinees for whom the test is 
intended. 



Approaches to Test Construction 



After a test developer has defined the general purpose of the test, identified the ex- 
aminees who are to take the test, described the theory on which the test is based, de- 
cided whether the test will be norm referenced or criterion referenced, outlined the 
objectives that will be measured, and selected a scaling method, the developer must 
choose an approach to test construction. Approaches to test construction can be di- 
vided into three basic categories: the rational approach, the empirical approach, and 
the bootstrap approach (Janda, 1998). 

Test developers who choose the rational approach rely on reason and logic to 
create items instead of relying on collecting data for statistical analysis when con- 
structing items (Janda, 1998). The rational approach is also called the theoretical ap- 
proach because the test developers are theorizing that the items are related to the con- 
structs they are attempting to measure (Hansen, 1999). Your instructor will probably 
use the rational approach when constructing your next test. In contrast, test devel- 
opers who choose the empirical approach rely on data collection to identify items 
that relate to the construct they are attempting to measure. In this approach, items 
are developed randomly, and whether items are used is based on the data gathered 
when the items are administered to a pool of examinees participating in the test con- 
struction process (Janda, 1998). Two different methods used in the empirical ap- 
proach are the method of contrast groups (in which items are examined based on the 
different responses of two or more groups of people who are selected because of cer- 
tain characteristics that each group has in common) and the method of item cluster- 
ing (in which factor analysis is used to identify which items correlate with one an- 
other) (Lichtenberg, 1999). The bootstrap approach is a combination of the rational 
approach and the empirical approach in that items are written based on a theory (in- 
stead of randomly), and then empirical procedures are used to verify that the items 
actually measure the construct they are theorized to measure (Janda, 1998). Another 
name for the bootstrap approach is the sequential method (Lichtenberg, 1999). 



A Test Development Example 



The reader now has a basic understanding of many of the decisions that a test devel- 
oper must consider in order to thoroughly delineate the purpose of the test. General 
examples of the concepts have been provided, but a more specific example may give 
a clearer picture of this crucial step in the test construction process. The Black 
Adolescent Racial Identity Scale (BAR/S) (see Figure 6.1) constructed by Sheperis 
(2001) serves as an example demonstrating the development of a test purpose. 



How Tests Are Constructed 



195 



BARIS 


Instructions: Each item may or may not be true for you. To the right of each item is a set of choices that 
describes how you think about the item. Select one of the choices by circling the number below it: 

Strongly Agree Agree Disagree Strongly Disagree 
4 3 2 1 

Please answer every item, and make only one choice per item. There are no right or wrong answers. 
If a question does not seem to apply to you, imagine a time that it might and answer the question 
based on your thought. 


Sample Question: 


Strongly 
Agree 


Agree 


Disagree 


Strongly 
Disagree 


A. I like pizza. 


4 


3 


2 


1 












Queston: 


Strongly 
Agree 


Agree 


Disagree 


Strongly 
Disagree 


1 . It is important to take part in Black activities. 


4 


3 


2 




2. Whites get more chances in life. 


4 


3 


2 




3. It is good to be around Blacks and other races. 


4 


3 


2 




4. Whites are more trustworthy than Blacks. 


4 


3 


2 




5. It is easier to get along with Black people. 


4 


3 


2 




6. People should be proud of their race. 


4 


3 


2 




7. Teenagers should only date people from the 
same race. 


4 


3 


2 




8. People from all races have good things about 
them. 


4 


3 


2 




9. It is good to get along with all kinds of people. 


4 


3 


2 




10. Children should know what it means to be 
Black. 


4 


3 


2 




1 1 . White counselors are better than Black 
counselors. 


4 


3 


2 




12. It is good to do things with people from all 
types of backgrounds. 


4 


3 


2 




13. It is OK to date somebody from another race. 


4 


3 


2 




14. White friends are better than Black friends. 


4 


3 


2 




15. People from all races should get along. 


4 


3 


2 




16. It's OK for Whites and Blacks to mix. 


4 


3 


2 




17. Black counselors understand kids better than 
White counselors. 


4 


3 


2 




18. It is better to have lighter skin. 


4 


3 


2 




19. Whites have nicer hair than Blacks. 


4 


3 


2 




20. It is important to belong to a Black church. 


4 


3 


2 




21 . It is good to learn about the race and 
background of others. 


4 


3 


2 




22. It is better to be more like Whites. 


4 


3 


2 


1 



Figure 6.1 The Black Adolescent Racial Identity Scale (BARIS) 



Sheperis (2001) created the BARIS "to measure racial identity development 
(RID) in Black adolescent males" (p. vii). This statement outlines the general pur- 
pose of the test, including the theory basis for the goals of the test and the examinees 



1 96 Chapter 6 



OBSERVABLES 



for whom the test is designed. Rather than simply creating a test to measure racial 
identity development, Sheperis constructed the test for the ultimate goal of using the 
information from the test to provide effective counseling programs for Black adoles- 
cent males who are involved in the juvenile justice system. The implicit theory that 
the test is based upon is twofold. First, the theory is that racial identity development 
occurs in measurable statuses (defined by Sheperis) for the purposes of the test as as- 
similation, self-segregation, and universal acceptance. Additionally, the theory is that 
knowledge of the racial identity development of Black adolescent males would lead 
to more effective counseling programs. As noted previously, the examinees are iden- 
tified as Black adolescent males. 

The next step that Sheperis (2001) had to consider when constructing the BARIS 
was whether the test would be criterion referenced or norm referenced. Because the 
purpose of the test is to compare characteristics of individuals (characteristics indicat- 
ing individuals' status of racial identity development) within a specified group (Black 
adolescent males), a norm-referenced test was an appropriate choice for the BARIS. As 
such, Sheperis did not need to consider specific criteria or objectives that the test 
would measure. However, he did need to consider the way he would go about meas- 
uring the different statuses of racial identity development, but this is somewhat dif- 
ferent from defining objectives and is discussed in the next section. 

The next question that Sheperis (2001) had to consider was the question of the 
scaling method he would use for the BARIS. He selected a 4-point scale. Individual 
items were designed to reflect the different statuses of racial identity, and response 
scores were summed to yield raw scores for each of the three statuses. Thus the scal- 
ing method was a summative rating scale. 

The final consideration that Sheperis (2001) had to take into account when 
defining the purpose of the BARIS was the approach to test construction that he 
would use. He used the bootstrap approach, or sequential model, which is a combi- 
nation of the rational approach and the empirical approach. He wrote items based 
on the theory of racial identity development after careful study of other measures of 
racial identity development and then identified the items to include in the test 
through empirical methods. An overview of the BARIS is provided in Box 6.1. 



Now that the purpose of the next course exam is known (including more informa- 
tion than you ever expected to be related to the purpose of any test), you may won- 
der what content the test will cover. Of course, you know the goals of the test and 
the instructional objectives that need to be mastered, but to really prepare for the 
test, you need to know exactly how the instructor is going to go about measuring 
whether students have met the objectives — for example, whether the test questions 
will require application of knowledge through scenarios or simply straightforward 
answers directly from this textbook. 

The instructor's decision about how to assess the content to be covered by the 
course exam is a question of observables. Observables are the specific variables and 
behaviors that are observable aspects of the construct stemming from the implicit 
theory. In terms of the course exam, the implicit theory is that test performance is 



How Tests Are Constructed 1 97 



Box 6.1 Overview of the BAR IS 

The Black Adolescent Racial Identity Scale (BARIS) was developed in several 
phases. Initial items for the BARIS were generated through a review of existing 
racial identity development (RID) scales and with attention to the tri-status 
model of racial identity development. The initial version of the BARIS, which 
was subjected to expert review, contained 59 items related to three RID sta- 
tuses: assimilation, self-segregation, and universal acceptance. In the initial 
phase of this study, 327 participants from Mississippi school districts com- 
pleted the BARIS and a feedback form. A factor analysis was used to identify 
the initial factor structure of the initial BARIS version. Based on the respective 
factor loadings on the three BARIS factors (i.e., assimilation, self-segregation, 
and universal acceptance), 37 items were eliminated from the initial instru- 
ment, leaving the 22 items comprising the final version of the BARIS. 

In an attempt to establish the concurrent and divergent validity (dis- 
cussed in Chapter 4) of the BARIS, a second phase of the study was con- 
ducted in which the BARIS was administered to 126 Black adolescent males 
from juvenile offender programs in Mississippi, Florida, and Pennsylvania. 
One of three additional RID instruments was administered to subgroups of 
25 participants along with the BARIS. The instruments included in this 
phase of the study were the Racial Identity Attitude Scale, the Multigroup 
Ethnic Identity Measure (MEIM), and the Adolescent Survey of Black Life. 

In order to establish a reliability estimate, Cronbach's alpha (discussed in 
Chapter 3) was computed for BARIS scores from the second phase of the 
study. Demographic information related to age, racial designation, socioeco- 
nomic status (SES), arrests, and involvement in the juvenile justice system 
was collected from participants in the second phase of the study. The results 
of this study showed statistically significant differences in scores based on de- 
mographic characteristics. With regard to concurrent validity, two statisti- 
cally significant correlations emerged from the analysis. Evidence of diver- 
gent validity was demonstrated by the lack of statistically significant 
correlations between the BARIS Assimilation and Universal factor scores and 
all scales of the MEIM. 



related to knowledge of course content. The answers given to the questions that the 
instructor chooses to ask on the test are the specific behaviors the instructor will ob- 
serve to determine whether students have mastered the course content. 



Defining Observables 



Test developers should use several steps to specify observables. First, they must define 
the content and skills to be measuredby the test. This step is similar to defining objec- 
tives for a criterion-referenced test; however, it applies to other types of tests as well. 
In a criterion-referenced test, the objectives may also serve as the content of the test. 



198 Chapter 6 



In other types of tests, the content or skills to be measured are more difficult to de- 
fine and are usually guided by the theory on which the test is based. Next, test de- 
velopers must describe traits or characteristics related to the content domain in behav- 
ioral terms. That is, they must decide what behaviors indicate that a person has 
certain traits or characteristics and describe the way in which they will measure those 
behaviors. For example, when constructing a course exam, the instructor will prob- 
ably identify the behavior of answering questions as an indicator that students have 
the trait of being knowledgeable of the course content; however, answering questions 
is only one example of a behavior that a test developer can choose to measure. A 
physical education instructor would probably not choose answering questions as the 
behavior to measure whether the students were physically fit. Instead, the instructor 
might choose and describe several physical tasks for the students to perform to indi- 
cate their level of physical fitness. Finally, the test developer may need to perform a 
job analysis, breaking the behavior chosen for observation into its smaller required 
tasks and skills. For example, the instructor should recognize the tasks students must 
complete to answer the questions on the next course exam (i.e., comprehending each 
question, recalling the information gained in class and from the textbook, synthesiz- 
ing that information to decide on a response, planning the response, and writing a 
response using correct grammar and readable handwriting). By breaking the job of 
answering the questions into its smaller parts, the instructor can better understand 
student responses and how they reflect knowledge of the course content. 



An Example of Observables 



ITEM GENERATION 



Using the BARIS as an example, Sheperis (2001) defined the observables of the test 
through the following steps: First, he identified the content domain through consid- 
eration of the theory of racial identity development and a thorough review of other 
tests that have purported to measure racial identity development. The content areas 
he chose to measure were assimilation, self-segregation, and universal acceptance. 
Next, he defined the traits associated with the identified content areas in behavioral 
terms. In this step, Sheperis (2001) classified statements of beliefs about race into 
the different categories that were defined by the content areas. He identified exami- 
nee behaviors as agreeing or disagreeing with the belief statements through their re- 
sponses on a Likert scale. Thus, responding to the test items became the observable 
behavior Sheperis used to measure examinees' status in racial identity development. 
Because of the nature of the BARIS, Sheperis did not conduct a job analysis of the 
test items but did conduct a factor analysis. 



Now you know that the questions your instructor is going to ask you on your next 
test are essentially observables. So, if the test items themselves are really small observ- 
able behaviors that the instructor is choosing to determine whether students have 
adequate knowledge of the course content, it follows that the instructor will proba- 
bly give a great deal of attention to writing the items themselves. Likewise, students 
will have main questions about the test items when preparing to study for the test. 



How Tests Are Constructed 1 99 



Students will probably ask how many items will be on the test and what percentages 
of the test will cover the different content areas included on the test. Students may 
also ask what the item format will be. These are questions that all test developers 
must answer when generating test items. They must give special consideration to the 
number of items to devote to certain topics or areas and the format of the test items. 



Allocating Proportionate Numbers of Items 



As you know, answers to test items are samples of behavior. It is important to keep 
the word samples in mind. In most instances, it would be virtually impossible for a 
test to thoroughly measure all aspects of a content area or construct for the simple 
reason that it would be far too time consuming. Therefore, items must be chosen to 
provide a representative sample of the behaviors that are included in the content area 
or construct that the test purports to measure (Hopkins, 1996). Furthermore, it is 
crucial that the proportion of test items devoted to each topic or area covered by the 
test reflects the importance of each of the individual areas being measured. 



Selecting an Item Format 



After test developers have decided what proportions of the test will be devoted to 
different topics or areas, they must select the format of the items. There are many 
item formats from which to choose, including the free-response format, the multi- 
ple-choice format, the true-false format, the Likert scale format, and many others. 
The format selected depends on what the examiner wants to know and provides a 
useful method for getting that information. If the test itself is well constructed, there 
is no technical advantage in using any one particular format for the items; however, 
test developers should choose an item format based on their own preferences, the 
setting in which the test will be used (Janda, 1998), and the type of information 
needed. Additionally, when choosing a format, test developers should be aware of 
the advantages and disadvantages associated with different item formats. For exam- 
ple, although in some instances multiple-choice formats may not be well suited to 
measure a broad cognitive range, multiple-choice tests are easy to score and quick to 
administer. Free-response formats may provide test administrators with more infor- 
mation about the examinees' thought processes, but tests using this format are more 
difficult to score and more expensive to administer (Martinez, 1999). 



Descriptions of Item Formats 



Item formats may be very simple, or they may be quite complex. The simplest for- 
mat is the dichotomous format, in which examinees are given two alternatives they 
must choose between in order to respond to each item. (Note: A true-false item is a 
dichotomous test item because the examinee must choose from two possible re- 
sponses — true or false.) Dichotomous formats are used not only for achievement 
tests but also for personality tests (Whiston, 2005). Some advantages of the dichoto- 
mous format are the ease with which tests in this format can be administered and 
scored and the fact that the examinees must use absolute judgment or decisiveness 



200 Chapter 6 



in choosing between the responses rather than being uncertain or vague. A major 
disadvantage of the dichotomous format when applied to an educational achieve- 
ment test is that examinees have a 50% chance of getting an item correct, and it may 
be difficult to determine whether examinees are merely guessing. 

Another relatively simple item format is the polytomous format. The polyto- 
mous format is much like the dichotomous format except that the examinee is given 
more than two response choices. (Multiple-choice items and matching items are 
items written in a polytomous format.) Advantages of tests that use the polytomous 
format include ease of administering and scoring. Also, compared with the dichoto- 
mous format, it is less likely that an examinee will get a correct answer by guessing 
on an item written in the polytomous format. The polytomous and dichotomous 
formats are used for all types of tests and are sometimes referred to collectively as the 
selected-response format (Cohen & Swerdlik, 1999). 

Both the dichotomous format and the polytomous format are item formats that 
an instructor may use on your next test because they are well suited to achievement 
tests. An item format that the instructor is not likely to use is the Likert format, de- 
scribed earlier in this chapter, because it also represents a scaling method. As you re- 
member, the Likert format requires examinees to indicate whether or not they agree 
with a statement or question by selecting from five choices that represent a contin- 
uum from Agree to Disagree. The Likert format is often used for personality, atti- 
tude, career, and aptitude tests (Whiston, 2005). 

Another item format available to test developers is the category format. This for- 
mat is very similar to the Likert format in that examinees are asked to rate items; 
however, examinees are given more choices for an item written in the category for- 
mat than they are given for an item written in the Likert format. For example, in- 
stead of having 5 choices representing the continuum, examinees may have 10 
choices (give or take a few). Giving examinees more choices along a continuum al- 
lows them to make finer distinctions in their ratings of the items (Whiston, 2005). 

Two other item formats that are sometimes used in personality tests are the 
checklist format and the Q-sort format. The checklist format requires examinees to 
read through a list of words or statements and check the ones that describe them- 
selves or their opinions, beliefs, or attitudes. Effectively, there are two possible re- 
sponses an examinee may choose for each item: checked (applies to examinee) or 
not-checked (does not apply) (Whiston, 2005). The Q-sort format allows examinees 
to describe themselves or others. Examinees are given statements and asked to sort 
them into a specified number of piles (e.g., nine) to indicate the degree to which they 
apply to the person they are describing. Examinees would place statements that did 
not apply in pile 1 and statements that definitely applied in pile 9. 

A final item format test developers may choose to use is the constructed-response 
format (also called the free-response format) (Janda, 1 998), which requires examinees to 
construct their own responses instead of choosing from a selection of responses. There 
are three types of constructed-response items: the completion item, the short-answer 
question, and the essay question (Cohen & Swerdlik, 1999). The completion item re- 
quires an examinee to respond by supplying a word or phrase to complete a sentence. 
You may know completion items as fill-in-the-blank items. The short-answer question 
requires examinees to respond by writing a short answer to a question (probably no 



How Tests Are Constructed 201 

longer than a paragraph and possibly as shorr as a single word). The essay question also 
requires an examinee to write an answer to a question; however, in most cases, the an- 
swer should be longer than a paragraph (Cohen & Swerdlik, 1999). The constructed- 
response format is often used for items on tests like a course exam. The advantages of 
using this type of format include the possibility of assessing examinees' understanding 
of course content on a deeper level than the level that may be assessed by other item 
formats. Disadvantages include difficulty in scoring and the length of time examinees 
may take to answer short-answer and essay questions. 



Think About It 6.1 What type of test item format would be the most 
effective to measure your ability to understand the information in this chap- 
ter. What types of item formats do you prefer? What types do you dislike? 
Why? 



An Example of Item Generation 

When Sheperis (2001) was generating the items for the BARIS, he first had to deter- 
mine how many items to devote to each of the three statuses of racial identity devel- 
opment that the test was intended to measure (assimilation, self-segregation, and 
universal acceptance). He chose the proportion of items that would apply to each 
status. The number of items applying to each status is roughly equivalent, and any 
differences in proportion are accounted for in the scoring procedures. 

The next decision Sheperis (2001) had to make was which item format he 
would use. Although the dichotomous format is often used in personality and atti- 
tude assessments, Sheperis chose the Likert format, which gave examinees more lat- 
itude to describe their beliefs than the dichotomous format would have. The di- 
chotomous format would have allowed examinees only to agree or disagree. 



TECHNICAL ANALYSES 



Many counseling students will take a comprehensive exam prior to graduation. 
Today many counseling programs use a standardized exam developed by the Center 
for Credentialing and Education (CCE; www.cce-global.org), called the Counselor 
Preparation Comprehension Examination (CPCE). Part of the reason for adopting a 
standardized exam is the difficulty involved in developing appropriate items from se- 
mester to semester. It is much easier and less expensive for university counseling pro- 
gram faculty to use a published instrument than to develop a quality comprehensive 
exam on their own. Developing good items for a test requires the test author to eval- 
uate each item in a number of ways. This process of evaluation is typically referred 
to as item analysis and involves an examination of item difficulty and item discrim- 
ination. Item analysis involves a variety of statistical techniques, and the process can 
be quite complex. Only a cursory overview of the process is presented here. Readers 
interested in a more in-depth discussion of item analysis are referred to Anastasi & 
Urbina(1997). 



202 Chapter 6 
Item Difficulty 



When preparing for a "comprehensive exam," it is important to recognize that stu- 
dents probably won't answer all of the items correctly. These types of exams are usu- 
ally criterion exams and are based on an examination of minimal competency in re- 
lation to a criterion rather than on competition among examinees. Some of the items 
will be difficult for most examinees to answer. So why not make the questions eas- 
ier? Let's assume that all examinees pass the comprehensive exam with flying colors. 
This would indicate that each student has met the minimum criterion for knowl- 
edge of practice in counseling. However, because the test items did not discriminate 
among examinees, it would be difficult, if not impossible, to make this assertion. 
Thus some students who did not possess adequate knowledge of the profession 
would be granted degrees. Because a main ethical principal is to "do no harm," cre- 
ating a test that everyone could pass would be highly unethical. Conversely, if one 
created a comprehensive exam that no one could pass, then it would still fail to dis- 
criminate among students. Professors would also have a large number of disgruntled 
students to manage. Thus the task of item development is complex. 

Item difficulty is a central issue in the technical analysis of a test; especially meas- 
ures of achievement or ability. Item difficulty is defined in terms of the number of ex- 
aminees who answer an item correctly. Thus, if 50% of the participants answer a par- 
ticular item correctly, that item has an item difficulty index of 0.50. Would this be a 
good item? Is it difficult enough? The essence of item difficulty analysis is to deter- 
mine the degree to which an examinee could correctly answer an item by chance 
alone. If the item with a 0.50 difficulty index is a true-false question, the examinee 
would have a 50% likelihood of getting the right answer by chance. Although as a 
student you might like these odds, the truth is that the item would not discriminate 
adequately between those who truly knew the answer and those who did not. 

So how does one set an appropriate discrimination index and make sure that 
each item meets this index? The first step is to determine the percentage of correct 
responses related to chance. To illustrate, let's continue with the true-false item and 
the 50% rate due to chance. To establish the usefulness of this true-false item, we 
must seek a discrimination index that is higher than 50%. Based on best practices in 
the field, we usually set the difficulty level halfway between a difficulty level of 100% 
(i.e., everyone getting the item right) and the rate of chance (i.e., 50%). To calculate 
the optimum difficulty level for our sample item, we subtract the chance level (50%) 
from the 100% success level and then divide the result by 2. The last step is to add 
the result of our division to the chance rate, thus providing an optimum difficulty 
level. In this case, 

100 ~ - 30 = -^ = 0.25 0.25 + 0.50 = 0.75 (optimal item difficulty level) 

Thus it would be expected that 75% of individuals attempting this item would 
answer it correctly. Considering the purpose of comprehensive exams (i.e., minimum 
competency), this might be an appropriate difficulty level. However, it is important 
to vary the difficulty level of items throughout the exam. Most people have taken a 
test in which the first item completely stumped them and the resulting performance 
suffered whether one knew the remaining answers or not. For this reason, a good ap- 



Item Discrimination 



How Tests Are Constructed 203 

proach to test construction is to place easier items at the beginning of a test and to 
increase item difficulty as the test progresses. This allows examinees a chance to build 
confidence in their performance and may reduce anxiety surrounding the test situa- 
tion. In some cases, test authors may even provide items at the beginning of a test 
that have a 1 .0 item difficulty index to increase the positive psychological state of ex- 
aminees. However, it should be noted that items that approach 1 .0 or are typically 
discarded because of their inability to discriminate among respondents. The typical 
item difficulty index ranges between 0.30 and 0.70 for most tests in which responses 
are marked right or wrong. However, some test authors seeking greater scrutiny of 
test-taker knowledge may employ a sample of more difficult items. For example, 
some states are now employing a clinical exam for licensure as a professional coun- 
selor. This type of exam is usually related to practice knowledge as opposed to the 
theory knowledge inherent in the "comprehensive exam" example. Because the pub- 
lic welfare is at stake with regard to a licensure exam, it would make sense to have 
greater scrutiny of applicants through the use of more difficult items. 



In theory, the purpose of an item discrimination index is to help assess the quality of 
a particular item. This task is achieved by examining the relationship between total 
test performance and performance on each individual item. By determining this re- 
lationship, we can decide if an item discriminates positively, discriminates negatively, 
or does not discriminate at all. A positively discriminating item is one that is answered 
correctly more often by those who perform well on the test. In contrast, a negatively 
discriminating item is one that is answered correctly by those who perform poorly on 
the test. A nondiscriminating item fails to indicate a relationship between correct re- 
sponse and test performance. There are numerous statistically derived, computer- 
generated, item discrimination indices, and the reader is referred to an SPSS manual 
or statistics text for in-depth study. 

Some professional counselors may find the discussion of psychometric evalua- 
tion, such as item discrimination, tedious and may even wonder how these types of 
analyses will apply to work in the field. Although few students will likely pursue a ca- 
reer in test construction, it is important to be a qualified user of psychological in- 
struments in order to function in future work settings. Part of being a qualified user 
means understanding how to evaluate the usefulness of an instrument as well as un- 
derstanding the usefulness of items within the instrument. The item discrimination 
index functions as an indicator of the quality of an item. If one is attempting to in- 
terpret the results of a test by comparing an individual's responses to a norm group, 
the item discrimination index tells the degree of confidence one can have in making 
an interpretation based on a response to a particular item. 



Think About It 6.2 Consider such exams as the Scholastic Assessment Test 
(SAT) and the Graduate Record Exam (GRF). Why would assessing item dif- 
ficulty and item discrimination for these tests be especially important? 



204 Chapter 6 

Norms 



In order to make individual raw scores or individual scale scores meaningful, test au- 
thors often administer the instrument to a large comparison sample, or norm group. 
The examinee's raw score is usually transformed to a standard score (e.g., z-score, T 
score, percentile rank, deviation IQ, or stanine) and then compared to the perform- 
ance of other individuals with similar characteristics (e.g., age, grade, gender, ethnic- 
ity, etc.). This population of individuals is referred to as the standardization sample, 
normative sample, or the norm group. The comparison scores are called derived scores 
and are placed into two groups: developmental scores and scores of relative standing 
(Salvia & Ysseldyke, 2004). 

Many tests use a procedure called stratified sampling, which seeks to sample the 
general population by replicating the percentage of participants according to demo- 
graphic characteristics. Some important demographic characteristics commonly used 
include sex (i.e. male, female); age (in years); grade (for achievement tests); race (e.g. 
White, African American, Asian American, Hispanic American, Native American); 
region of (U.S.) residence (e.g., south, west, northeast, north central); socioeconomic 
level (e.g., parent educational attainment, family income, parent occupational sta- 
tus); and area of residence (e.g., urban, suburban, rural). In America, the U.S. 
Census is consulted, and participants are sampled according to their occurrence in 
the general population (e.g., 50% male, 50% female). 

An additional consideration is the number of participants to include in a norm 
group. According to Salvia and Ysseldyke (2004), a general rule of thumb is 100 par- 
ticipants per age category for screening tests, and 200 participants per age category 
for diagnostic tests. Sampling is an absolutely critical consideration in test develop- 
ment, and particular attention should be paid to multicultural and diversity consid- 
erations. If a norm sample underrepresents key groups (e.g., racial, socioeconomic, 
sex), it becomes difficult to support the accuracy of interpretations for those indi- 
viduals examined using the test. 

As an example, the BARIS was normed on a group of Black adolescent males in 
the southern United States. Thus scores for an individual test taker can be compared 
to average scores of other Black adolescent males in the same geographic region. 
However, the development of norm-referenced scores is not as simple a task as is in- 
dicated by this example. The nature of this chapter does not allow for extensive dis- 
cussion of the development of norm-referenced scoring procedures. For further in- 
formation on this topic, readers are referred to the Standards for Educational and 
Psychological Testing (AERA, APA, & NCME, 1999). Readers should also refer to 
Chapter 5 in this text for more in-depth discussion on this topic. 



SUMMARY/CONCLUSION 



The development of psychological tests is an intricate process that often takes several 
years to complete effectively. In order to select quality tests, professional counselors 
should develop a basic understanding of the test construction process. In general, 
test construction occurs in distinct phases: 



How Tests Are Constructed 205 

1 . Needs analysis. Because the development of quality tests is such a time-consum- 
ing process, test authors often establish a need for a certain test before begin- 
ning the construction process. Needs analysis can be conducted through formal 
surveys or through an analysis of current instruments available (Drummond, 
2004). 

2. Test purpose. Once a need for a test is established, it is then important to de- 
velop clear, behavioral objectives for the development of the proposed instru- 
ment. One of the objectives should be related to the construct or content do- 
main to be measured (AERA et al., 1999). For example, the BARIS was 
designed to measure the racial or ethnic identity of Black adolescent males. 

3. Item format. Prior to beginning the development of specific items for an instru- 
ment, it is important to determine the appropriate format for meeting the 
stated test purpose. Item formats include multiple-choice, forced-choice, open- 
response, true- false, essay, or Likert scale (Janda, 1998). In order to provide re- 
spondents with a limited range of choices, a forced-choice response format was 
employed for the BARIS, with the choices being (a) Strongly Agree, (b) Agree, 
(c) Disagree, and (d) Strongly Disagree. 

4. Choosing an approach to test construction. Several approaches to test construction 
are available (e.g., rational approach, empirical approach, and bootstrap ap- 
proach). The bootstrap approach was used to develop the BARIS. The bootstrap 
approach is a combination of the rational approach and the empirical approach. 
The item pool for the BARIS was derived from racial identity development the- 
ories. Empirical methods of analyses were used to maintain or discard items 
from the initial pool. 

5. Item development. Writing effective test items is a difficult process. Test items 
should be reviewed by a panel of experts to ensure that the items cover the do- 
main being measured and to determine the degree to which the items match 
the purpose of the test. Previously exiting theories and item pools should also 
be explored to ensure the items included on a test represent the domain of con- 
tent being assessed. Items in the BARIS were reviewed by experts in the field of 
multicultural counseling. 

6. Pilot test. Prior to administering the instrument to a large sample, a pilot test 
should be conducted to determine item difficulty, discrimination, and compre- 
hension. Test authors often ask pilot test participants to complete feedback 
sheets that ask about the participants' (a) perception of the test, (b) particularly 
easy or difficult items, (c) confusing terms, (d) clarity of directions, and (e) gen- 
eral concerns. Test authors conduct in-depth item analysis studies to be sure 
items "behave" as expected. This process was completed in the initial pilot test 
for the BARIS. 

7. Item review. After the initial pilot test, it is important to review findings about 
item difficulty and discrimination in order to determine items that should be 
removed from the item pool. Test authors should also examine items for bias 
(i.e., cultural, gender, socio-economic, ability, and sexuality). According to the 
Standards for Educational and Psychological Testing (AERA et al., 1999, p. 82), 
"Test developers should strive to identify and eliminate language, symbols, 



206 Chapter 6 



words, phrases, and content that are generally regarded as offensive by mem- 
bers of racial, ethnic, gender, or other groups, except when judged to be neces- 
sary for adequate representation of the domain." 

8. Preparing the test for operational use. Once the pilot test has been completed and 
the remaining items are reviewed for bias, it is important to prepare the test for 
operational use. This means that the author should review the objectives and 
purpose of the test to ensure that the resulting instrument still meets the origi- 
nal intent of the author; scoring procedures should be independently verified; 
and the instrument should be reviewed by various committees. 

9. Establishing the psychometric properties of the test. One of the last steps in test de- 
velopment is to establish the technical properties. The test author must deter- 
mine an appropriate sample size for the statistical analyses to be performed on 
the instrument. Sample size can vary greatly depending on the analyses em- 
ployed. Once sample size is determined, the test author administers the instru- 
ment, scores it, and computes reliability and validity coefficients (Drummond, 
2004). This process can occur in several phases and several individual research 
endeavors. Finally, the author develops norms for the test. 

1 0. Ensuring the appropriateness of the norm or criterion group. Test authors provide 
norms derived from appropriate sampling procedures (e.g., stratified or selective 
samples) that account for multicultural and diversity considerations. Test users 
must ensure that the test is used to make decisions only about clients for whom 
the test was designed and validated for use. 



Think About It 6.3 Why would it be important to carry out all of these 
steps when developing a test? What would happen if a step were skipped? 
Would the test still be an effective measure of the desired construct? Explain. 



KEY TERMS 



age range 

bootstrap approach 

content 

criterion referenced 

dichotomous format 

empirical approach 

items 

item analysis 

item difficulty 

item format 

norm group 

norm referenced 



objectives 

observables 

polytomous format 

population 

purpose 

rational approach 

reading ability 

scaling 

stratified sampling 

summative scale 

theory 




CHAPTER 



7 



Clinical Assessment 

by Bradley T. Erford, Carol Salisbury, Kathleen McNinch, 
Carl Sheperis, R. Anthony Doggett, and Ota Masanori 



Overall, professional counselors in clinical practice engage in clinical and per- 
sonality assessment more frequently than any other type of assessment. 
Knowing the characteristics and conditions of clients is important regard- 
less of counseling specialty. Clinical and personality assessment is defined and ex- 
plored in detail in this chapter, and numerous inventories commonly used by pro- 
fessional counselors are presented and reviewed. In addition, the basic process of 
clinical interviewing is introduced, both for general and for more specific purposes, 
such as when conducting a mental status exam. Personality assessment is viewed 
from both the psychoanalytic and the "big-five model" perspectives, thus allowing a 
basic introduction to projective and objective personality assessment. 



WHAT IS CLINICAL ASSESSMENT? 



To some, clinical assessment and personality assessment are one and the same. They 
are ways of understanding the dispositions, characteristics, strengths, and limita- 
tions of the internal world of a client and how that client interacts and functions 
within the client's external world. Some even view personality as a global, holistic, 
all-encompassing construct that subsumes all the other facets of life and especially 
the facets of assessment covered in this book. In other words, in the broadest sense 
of the word, intelligence, aptitude, achievement, career, normal and abnormal be- 
havior and emotions, personal adjustment, family, and everything else are sub- 
sumed under the category of personality. Unfortunately, while well intentioned, 



207 



208 Chapter 7 



such a perspective or approach broadens the study of personality far beyond a man- 
ageable degree. The perspective taken throughout this chapter is far more pro- 
scribed. Here, clinical assessment is defined as the measurement of clinical symp- 
toms and pathology in the human condition — in other words, assessment for the 
purpose of clinical diagnosis. Personality assessment, on the other hand, is the 
measurement of client traits, needs, motivations, attitudes, or other facets that de- 
scribe how the client interacts with the external environment, others within that 
environment, and within the client's internal world. While some may view some 
of these intrapersonal or interpersonal interactions to be normal or abnormal, the 
purpose of personality assessment is more appropriately conceived as describing 
the personal functioning of an individual globally or within some context. 

While some may view this distinction as artificial, the implications are not. 
Professional counselors are often required to diagnose and treat clients with mental 
and emotional disorders. A client may present with symptoms of depression, anxi- 
ety, disruptive behavior, substance use, and so forth. To diagnose and treat the client 
in an ethical and professional manner, professional counselors will rely on tests and 
techniques that facilitate the diagnostic and treatment process, and determine the 
outcomes of treatment — three primary purposes of clinical assessment. While it may 
be helpful to understand the personality characteristics of a client, it is not always es- 
sential for effective treatment, particularly when using brief treatment approaches. 
When diagnosing and treating clients, professional counselors often use assessment 
procedures such as clinical interviewing, structured clinical tests — e.g., the Minnesota 
Multiphasic Personality Inventory — Second Edition (MMPI-2), the Millon Clinical 
Multiaxial Inventory — III (MCMI-III) — and a mental status exam to facilitate effi- 
cient and accurate diagnosis and treatment. 

When a client seeks counseling for self-growth or a personal or interpersonal 
problem not amenable to clinical diagnosis, clinical assessment is probably not war- 
ranted. However, personality tests can be helpful in deepening both the professional 
counselor's and the client's understanding of the client's personality and coping 
mechanisms when under normal and stressful circumstances. Developing such an 
understanding of thoughts, feelings, and behaviors provides a basis for clients to un- 
derstand why they think, feel, and behave the way that they do. To facilitate this un- 
derstanding, professional counselors often use assessment procedures such as devel- 
opmental interviewing, structured personality tests — e.g., the Myers-Briggs Type 
Indicator (A/577), the {I6PF), the {NEO-PI-R) — or unstructured, projective tests 
and techniques — e.g., House-Tree-Person, Incomplete Sentences, Thematic Apperception 
Test. These instruments and the general topic of personality assessment are addressed 
in more detail in Chapter 8. 

Importantly, many psychological instruments can provide helpful information 
to understand a clients clinical issues and personality functioning. So, while these 
categories may seem mutually exclusive, tests and test items can be designed to pro- 
vide information about both. For simplicity's sake, the authors of this chapter have 
chosen to present these tests in the domain in which they are most commonly used 
in clinical practice. 



Clinical Assessment 209 



CAUTIONS WITHIN CLINICAL ASSESSMENT 



In Chapter 1, the general purposes of assessment were outlined. The three purposes 
most relevant to clinical assessment are diagnosis, treatment planning, and outcomes 
assessment. Cohen and Swerdlik (1999, p. 482) indicated three primary questions 
addressed by clinical assessment: (1) Does this person have a mental disorder, and if 
so, what is the diagnosis? (2) What is the person's current level of functioning? (3) 
What type of treatment shall this patient be offered? Erford (2006, p. 9) added an- 
other: How effective were the implemented interventions? 

Many professional counselors find it most efficient to use a combination of in- 
terviewing and structured test administration to quickly and accurately diagnose 
client concerns. That said, if all clients were totally self- aware, open, and forthright 
in their responses, clinical assessment would be simple, and the text of this chapter 
could move immediately to the sections on interviewing and structured inventories. 
Unfortunately, clients present with varying levels of self-awareness, openness, and 
forthrightness, and professional counselors must take great care to ensure that the 
diagnostic and treatment decisions made about a client are based on accurate infor- 
mation. Thus, professional counselors must be well aware of important bias issues 
in both assessment and decision making (i.e., judgment). 

Bias in clinical interviewing has been studied for years. Darley and Fazio (1980) 
coined the term hypothesis confirmation bias to explain the observed phenomenon 
in which interviewers develop hypotheses to explain the concerns being presented 
by a client and then proceed to ask questions and elicit responses that confirm those 
hypotheses. While on the surface this may sound like good, sound practice, Darley 
and Fazio found that clinicians frequently confirmed incorrect hypotheses by inter- 
preting ambiguous information as supportive of the hypothesis and discounting ev- 
idence that did not support the hypothesis. Likewise, the term self-fulfilling 
prophecy (Dipboye, 1982) has been used to describe the client's propensity to 
change responses and behavior to conform to the expectations of the examiner. 
Often, the client will actually change thoughts, feelings, or actions to align with the 
perceived expectations of the interviewer. For example, assume a client with low anx- 
iety responds that he or she feels anxious from time to time to a mild degree. If the 
professional counselor pursues this issue with a line of questioning aimed at under- 
standing the degree of anxiety involved, especially in the context of situations the 
client may find otherwise troublesome (e.g., interpersonal or workplace relation- 
ships), then the client may perceive and "admit" the anxiety to be more problematic 
than first suspected. Thus, the client fulfills the perceived prophecy, even though it 
may not be true. With these possible threats to the validity of interview results, pro- 
fessional counselors in training may wonder why interviewing is so popular among 
clinicians. Again, bias provides the answers. Arvy and Campion (1982) suggested 
three reasons: (1) Interviews provide a depth of information and perspective that is 
difficult to obtain using tests alone, (2) clinicians believe themselves to be unbiased, 
ostensibly because they are good, helpful people, and (3) clinicians believe they are 
objective and unbiased because they are highly trained and skilled. Note that the 
final two reasons involve beliefs on the part of the clinician. No matter how well 



210 Chapter 7 



intentioned, any belief can be biased. After all, that is why it is called a belief and 
not a truism, fact, or law. Every professional counselor must guard against interview 
response bias. No one is immune. 

Equally important, test results can also be biased and inaccurate. It is not hard 
to understand that results will be inaccurate if someone responds dishonestly to 
questions. But in actuality, many factors influence student and client responses to 
items or questions and their subsequent scores on tests. Sometimes these factors may 
be related to the test itself, while at other times to examiner or examinee variables. 
Some clients or students may present themselves dishonestly, or lack self-awareness 
to respond appropriately. Others may not trust the professional counselor for a vari- 
ety of reasons, some of which have more to do with the client than the counselor. 
Still others may respond inaccurately because of the way a question is phrased, or 
the type of response choices required. Regardless of the cause, the result is problem- 
atic. Inaccurate client responses lead to inaccurate scores, inferences, and interpreta- 
tions (i.e., errors). Table 7.1 provides brief descriptions of a number of factors influ- 
encing client responses and performances commonly encountered by professional 
counselors in clinical practice. A more in-depth discussion of these issues can be 
found in Erford (2006). 

In the context of this discussion of clinical response accuracy, further expansion 
of this list becomes necessary. In the early years of psychological testing (i.e., 
1920s- 1930s), little concern was given to the accuracy of client responses to person- 
ality or clinical questions. Many assumed that clients would respond honestly, and 
while many clients did respond honestly, examiners quickly learned that not every- 
one did. While honesty is a good thing, the present-day field of assessment has 
evolved in such a way that many clients seek the services of professional counselors 
for help with issues of great importance: child custody, criminal actions, disability 
documentation, infidelity, and divorce, to name but a few. Likewise, client self- 
awareness and the relationship between client and counselor can significantly influ- 
ence the accuracy of client responses. Thus, to assume that all clients always respond 
accurately is naive and dangerous. A professional counselor's judgment frequently 
has personal, financial, and legal implications in such high-stakes decisions. 

During the 1940s and through present day, developers of clinical, personality, 
and behavioral inventories have expended a great deal of effort to construct validity 
scales that can help identify client response styles. Identification of these response 
modes can help professional counselors identify clients whose test protocols may be 
invalid or should be interpreted with caution. Many tests provide validity scales, and 
the names and functions of these scales vary widely. A good example of a present- 
day clinical instrument with helpful validity scales is the 567-item, true-false 
Minnesota Mtdtiphasic Personality Inventory — Second Edition {MM PI-2). The MMPI- 
2 offers a number of helpful scales, including Cannot Say (?), VRIN, TRIN, F, L, K, 
and S (Butcher et aL, 2001). 

While clients are encouraged to answer every one of the MMPl-2% 567 ques- 
tions, many do not. Because raw scores are summed and used to determine a client's 
norm-referenced score, a client who does not complete a significant number of ques- 
tions may have deflated scores. This is because failing to answer a question is scored 
in the nonkeycd (i.e., not clinically relevant) direction, as if to indicate that the client 



Clinical Assessment 



211 



Table 7.1 Factors that influence student and client test performance and item responses 



Factor 



Description 



Motivation 

Anxiety 
Coaching 

Test Sophistication 
Acquiescence 
Response format 

Reactive effects 
Response bias 



Physical or psychological 
condition 



Social desirability 



Environmental variables 



Cultural bias 



Examiner-Examinee variables 



Previous testing experiences 



Motivated clients provide accurate responses; unmotivated clients provide subpar performance, 
inaccurate, and/or dishonest responses. Client motivation is the most important performance 
factor. 

High and low levels of anxiety lead to low levels of performance. Moderate levels of anxiety 
maximize performance. This is referred to as the Yerkes-Dodson law. 

Coaching is any procedure that gives a respondent an advantage. Coaching can involve anything 
from a simple review of the domain of information being assessed to instructions on giving 
specific responses to specific questions that will appear on a test. Suspicions of coaching should 
be followed up on by the examiner. 

Test sophistication refers to procedural advantages enjoyed by some test takers, but not others 
(e.g., experience filling in bubble response forms). 

The tendency to answer yes to yes/no questions and true to true/false questions when an 
examinee is unsure of the correct answer. 

Clients with reading problems, writing problems, poor vision, or disabilities that make sitting 
difficult may become frustrated with a test requiring reading or constructed written responses. 
Allowances should be made for audio-taped administration and oral response procedures when 
possible. 

Clients may alter response styles and patterns in response to the interview or evaluation process 
(i.e., a series of questions about depressive symptoms could lead clients to perceive in themselves 
a greater degree of depression than previously considered). 

A client's response to a question influences responses to future questions (i.e., students who 
select "False" three times in a row may be more likely to select "False" on the next item, even 
though they would have otherwise selected "True"). 

Clients sometimes present with visual or auditory acuity problems or psychological processing 
deficiencies (e.g., central auditory processing disorder). In addition, mental disorders can cause 
psychological conditions that detrimentally affect test performance, such as moderate to severe 
depression or anxiety, or other disorders that exacerbate mood or distractibility. 
Some clients, consciously or unconsciously, may respond in a way that portrays themselves in a 
more favorable manner (i.e., faking good) and appear less significantly impaired than they really 
are. Others may portray themselves in a less favorable light (i.e., faking bad) and appear more 
severely impaired than they really are. 

Some common individual-specific environmental effects include time of day, testing room, 
lighting, seating arrangements/comfort, noise, and interruptions. Each could affect a client's or 
student's motivation and performance, but the effects are so individualized that scientific 
generalizations are normally lacking. Following standardized procedures and minimizing 
environmental influences are primarily examiner responsibilities. 

Impressions, interpretations, and diagnoses can be influenced by the culture of the examinee and 
examiner. The professional counselor strives for multicultural competence to minimize biased 
conclusions. 

Some clients and professional counselors just seem to hit it off; others don't. Race, sex, culture, 
attractiveness, personality, and other variables may influence a client's performance, but scientific 
study indicates they seldom do. 

Positive or negative previous assessment experiences may lead to higher experiences or lower self- 
confidence, thus influencing motivation and performance. Also, some clients may remember 
content from a previous administration of a test and may have a "memory" advantage on 
intelligence and achievement tests. 



Source: The Counselor's Guide to Clinical, Personality, and Behavioral Assessment by B. T.Erford, (2006), (ed.). Boston: Lahaska Press/Houghton 
Mifflin. 



212 Chapter 7 



does not have a problem. The Cannot Say (?) scale is simply a count of the items to 
which no response was made. Generally if clients fail to respond to 30 or more items 
(about 5%), the protocol may be judged invalid; if 1 1-29 questions are not an- 
swered, caution is warranted because some subscales may be invalid. Several helpful 
scales are termed "content-free," because the content of the scale is not important in 
determining score validity. VRIN is the acronym for the Variable Response 
Inconsistency scale, which measures a client's pattern of inconsistent responding to 
pairs of items nearly identical in content. The VRIN raw score indicates the num- 
ber of inconsistent client responses. Inconsistent responding may mean the client is 
not paying attention, not taking the task seriously, or doesn't comprehend the item 
meanings. TRIN is the acronym for the True Response Inconsistency scale, which 
measures a client's pattern of inconsistent responding to pairs of items of opposite 
content. The TRIN raw score indicates the degree of client response inconsistency 
due to "yea-saying" (acquiescence) or "nay-saying" (nonacquiescence). T scores of 
80+ on the VRIN or TRIN scales indicate the protocol is invalid. 

Other validity scales are content-specific. The Infrequency scale (F) is a measure 
alerting clinicians to unusual patterns of answers. These 60 items were selected be- 
cause they were infrequently endorsed by members of the original MMPI norm sam- 
ple. Clients with a high score on the F scale (T = 100+) generally are random respon- 
ded (i.e., paying no attention to items and just coloring in bubbles) or fixed 
responders (i.e., mostly all true or mostly all false), or are "faking bad" by deliberately 
trying to portray themselves in a negative light. Of course, the professional coun- 
selor must rule out whether the client may also be accurately portraying severe 
pathology. The MMPI-2 also has F B (Back F) and Fp (Infrequency-Psychopathology) 
scales. The F B scale is an infrequency-of-response scale for the latter part of the test 
and, when compared to the total F score, helps determine whether clients changed 
their response approach during the administration (e.g., the client got bored and 
began to respond randomly, or to overreport symptoms). The Fp scale is interpreted 
in conjunction with VRIN and TRIN scales to determine whether a client may be 
responding randomly, "faking bad," or exaggerating pathological symptoms. 

The L scale was originally developed to assess the existence of a defensive mind- 
set by allowing clients to deny the existence of minor faults and flaws that most oth- 
ers readily admitted. While it may indicate deceit in test taking, the L scale is fre- 
quently used in conjunction with TRIN to determine the presence of "faking good" 
and nonacquiescence (i.e., nea-saying) when responding. The K scale was originally 
developed to measure client response defensiveness so as to correct for this response 
style on the clinical scales. It was believed that if a clinician knew that a client was 
responding in such a way that would invalidate the protocol, corrections could be 
made to the clinical scales so as to still derive meaningful results from them. 
Interestingly, some researchers (McCrae & Costa, 1989) have shown that uncor- 
rected scale scores have higher validity than K-corrected scores. On other tests, re- 
searchers have also demonstrated this to be the case (Hsu, 1986; Kozma & Stones, 
l')87; McCrae & Costa, 1983). The S scale (Superlative Self-Presentation) was em- 
pirically derived by Butcher & I Ian ( 1 l ) c )S) by identifying items that were helpful in 
discriminating between defensive and normal job applicants and norm sample par- 



Clinical Assessment 213 

ticipants. Similar to the K scale, the S scale may also be helpful in determining 
whether clients are presenting themselves in a socially desirable or nonacquiescent 
manner. 

Other clinical and personality tests have various validity scales under different 
names designed to assess client response patterns for varying purposes. And the use 
of validity scales is on the rise, no doubt due to clinician desire to more accurately 
identify invalid response protocols and better technology for developing such scales. 
Professional counselors are advised to seek specialized training and to read the man- 
uals of instruments using these scales in order to fully understand how this technol- 
ogy can be harnessed to enhance scale interpretation, and to understand the impli- 
cations of elevated scores. 



CLINICAL JUDGMENT VERSUS STATISTICAL MODELS 



Professional counselors must be wary of bias not only from clients and other informa- 
tion sources, but also from within themselves. Most professional counselors have great 
faith in their own clinical judgment; after all, professional counselors spend years in 
education and clinical preparation to practice their craft. They have successes and set- 
backs but are constantly improving and honing their skills to the point of competent 
practice. It is easy to assume that such a rigorous program of study and practice under 
supervision will remove bias and sharpen the professional counselor's clinical objectiv- 
ity. Unfortunately, such is not always the case. Regardless of how well educated, well 
trained, and well practiced one becomes, a professional counselor is only as perfect as 
the information obtained and interpreted and the decision-making model employed. 
Hopefully, the information presented in the preceding chapters has given readers an 
appreciation for the imperfection of the information they will encounter, the interpre- 
tive strategies they will employ, and the accuracy rates of various decision-making 
models. Errors will always be with us. However, there are ways that a professional 
counselor can increase the likelihood of more accurate decision making. 

Much has been written about the efficacy of decision-making models employing 
clinical judgment versus statistical models — and the evidence is that the statistical 
models are at least as accurate as, and usually superior to, clinical judgment (Dawes, 
1971; Dawes & Corrigan, 1974; Goldberg, 1970; Meehl, 1954, 1957, 1965). 
Intuitively, this makes sense, because statistical models are based on probabilities that 
can be empirically replicated, studied, and often improved on. Clinical judgment is 
individual-specific, so what makes sense to one professional counselor, may not only 
not make sense to another professional counselor, but also may not be easily replicated 
by another professional counselor. As with all things related to measurement, reliabil- 
ity (i.e., replicability) sets the upward boundary for validity. So if clinicians cannot 
replicate a decision model efficiently, the validity of results will be lowered. Such is the 
advantage of a statistical decision-making model; it is easily understood and replica- 
ble, therefore may produce more accurate decisions, although it may not always be 
presumed to do so. Betting against a statistical model is similar to betting against the 
house in a game of chance. Sometimes you will win, but the odds are always against 
you; skill and knowledge helps sometimes, but not most of the time. 



214 Chapter 7 



Statistical models in clinical decision making often rely on the use of cutoff 
scores that are empirically validated. Professional counselors are wise to consider the 
implications of "betting against" the statistical model. Experienced clinicians know 
the value of multiple sources of information from multiple respondents. When clin- 
ical judgment disagrees with the statistical model, the experienced clinician usually 
realizes that it is best to collect more information to arrive at a more reasoned deci- 
sion that one can endorse with greater confidence. 



Think About It 7.1 In your practice as a professional counselor, you will 
encounter situations in which a decision using your "statistical model" does 
not agree with your "clinical judgment." How will you reconcile this conflict 
to arrive at the best decision for your client? 



CLINICAL INTERVIEWING 



There are several essential components to an effective interview. First, establishing 
rapport is crucial. A professional counselor must relate a sense of mutual understand- 
ing, confidence, respect, and acceptance in order to facilitate effective rapport 
(Sattler, 2002). Establishing rapport is especially important in the initial interview to 
help clients feel comfortable enough to openly discuss their reasons for coming to 
counseling. Second, an interviewer needs to have effective facilitative skills. Effective 
interviews (a) identify client problems clearly; (b) obtain necessary information re- 
lated to the problems (e.g., antecedent, consequence); (c) assess client functioning, 
intellectual level, and psychosocial development; and (d) examine the effects of an 
intervention during and after the intervention. As Kratochwill, Sheridan, Carlson, 
and Lasecki (1999) posited, eliciting useful information largely depends on the in- 
terviewer's ability to strategically use questions and statements. 



Three Types of Interviews: 

Unstructured, Semi- Structured, and Structured 



Depending on the purpose of the interview, a professional counselor should choose 
an appropriate interview level from among the following: (1) structured, (2) semi- 
structured, and (3) unstructured. The structured interview has established question 
formats and is often used to assess or diagnose disorders. Generally, structured inter- 
views are shown to yield more reliable results because they are able to be more accu- 
rately replicated by others and are less subject to a clinician's biases. It is unclear 
whether structured interviews yield more valid results (McReynolds, 1989). In a 
structured interview, every professional counselor asks the same set of questions in 
the same order, regardless of the examinee. Some structured interview formats are 
purposely broad in scope and function, others narrow. Erford provided an example 
of a structured clinical interview of a narrow focus (see Erford, 2006, Appendix C: 
Attention-Deficit! Hyperactivity Disorder [AD/HD] Brief Clinical Parent Interview 



Clinical Assessment 



215 



Table 7.2 Published structured interviews 



CIDI-Core Composite International Diagnostic Interview: Authorized Core Version 1.0 

{World Health Organization, 1993) 
DIS Diagnostic Interview Schedule {National Institute of Mental Health, 1990) 

DICA-R Diagnostic Interview for Children and Adolescents 8.0 (Reich, 1996) 

DISC-IV Diagnostic Interview Schedule for Children (Shaffer, 1 996) 

CAPA Child and Adolescent Psychiatric Assessment Version 4.2 — Child Version (Angold, 

Cox, Pendergast, Rutter, & Simonoff, 1 996) 
CAS Child Adolescent Schedule (Hodges, 1997) 

K-SADS-IVR Schedule for Affective Disorders & Schizophrenia for School-Age Children 

(Ambrosini & Dixon, 1996) 
K-SADS-PL Revised Schedule for Affective Disorders & Schizophrenia for School-Age Children: 

Present and Lifetime Version (Kaufman, Birmaher, Brent, Rao, & Ryan, 1996) 
K-SADS-E5 Schedule for Affective Disorders & Schizophrenia for School-Age Children, 

Epidemiological Version 5 (Orvaschel, 1995) 



[ABCPI]). Examples of published broad-spectrum structured interviews are provided 
in Table 7.2. 

The semi-structured interview may also have a specific question format that is 
used to assess specific mental health issues or psychological disorders. However, in 
contrast to the structured interview, a professional counselor can modify questions 
or change the order in which the questions are asked depending on a client's level of 
functioning (e.g., verbal or intellectual level) or other situational requirements 
(Sattler, 2002). Erford provided a good example of a semi-structured interview (see 
Erford, 2006; Appendix D: Semi-Structured Mental Status Examination Interview 
Protocol). Some other published semi-structured interviews are provided in Table 7.3. 

Finally, the unstructured interview has no standardized question format. An in- 
terviewer chooses questions depending on the client and situation. In order to con- 
duct an effective unstructured interview for clinical diagnostic purposes, a profes- 
sional counselor should have advanced assessment training and be able to elicit the 
clients concerns through appropriate questions. The skilled professional counselor 
can use the unstructured interview as an effective tool to establish rapport and to 
elicit concerns freely during an intake interview. Regardless of the type of interview 



Table 7.3 Published semi-structured interviews 



SCID-CV Structured Clinical Interview for Axis I DSM-PV Disorders (First, Spitzer, et al., 

1997). 
SCID-II Structured Clinical Interview for Axis II DSM-PV Disorders (First, Gibbon, et al., 

1997). 
PRISM Psychiatric Research Interview for Substance and Mental Disorders (Hassin et al., 

1996). 
SCICA Structured Clinical Interview for Children and Adolescents (McConaughy & 

Achenbach, 1994). 



216 Chapter 7 



The Intake Interview 



employed, it is crucial to establish rapport and to elicit necessary information 
through effective verbal communication. Like any other facet of effective counsel- 
ing, facilitative skills are an essential component. 



The purpose of the intake interview is to collect relevant information about a clients 
history and background in order to quickly ascertain the effects past events may have 
on the client's current situation. Previous history often helps professional counselors 
to provide a context for current struggles, determine the longevity of symptoms, and 
tailor treatment interventions to the client's specific context. For example, a client 
presenting with a five-year history of substantial symptoms of anxiety likely will re- 
quire a different diagnostic and treatment approach than someone who has devel- 
oped substantial symptoms only during the past month. 

The major advantage of a structured intake interview is that it can be completed 
by a client prior to the first session. Then the professional counselor can peruse the 
client's responses and follow up with any details or questions concerning original 
client responses. This saves a great deal of time. Of course, the professional coun- 
selor should verify client responses and expand on them as necessary, because clients 
sometimes misunderstand the intent of a given question, or are hesitant to provide 
full disclosure; that is, some clients understandably reveal more in a person-to-per- 
son interview than on a piece of paper. Erford (2006) developed a comprehensive 
eight-page structured Client History and Background intake form that professional 
counselors will find useful. Erford (p. 8) also specified the eight key areas that make 
up a comprehensive intake interview: 

1 . Demographic information: name, age, sex, marital status, race or ethnicity, reli- 
gion, socioeconomic status, occupation, and languages spoken. 

2. Referral reasons: symptoms or complaints, including whether the complaint is 
likely to end up as a legal issue. 

3. Current situation: severity of the referral complaints' resiliency factors, such as 
client strengths and important support figures. This area also includes changes in 
functioning as a result of the referral concern. 

4. Previous assessments and counseling experiences: what led to initiation of previous 
services, what interventions were attempted, and any outcomes or such interven- 
tions. It is also important to determine previously offered diagnoses and medica- 
tions taken to address mental and emotional issues. 

5. Birth and developmental history: circumstances of birth and delivery, timing of 
early developmental milestones, or difficulties encountered during development. 

6. Family history: composition of family of origin and current family; any educa- 
tional, medical, or psychological difficulties family members may display or have 
displayed in the past. 

7. Medical history: major injuries, surgeries, conditions or illnesses, and medica- 
tions currently taken. This area also includes the client's current medical 
status. 



Mental Status Exam 



Clinical Assessment 2 1 7 

Educational and work background: highest education completed, learning diffi- 
culties encountered, special services received, work history, and current work set- 
ting and satisfaction. 



Think About It 7.2 Describe the importance of a thorough intake inter- 
view. How could your ability to establish rapport and use facilitative skills in- 
fluence the intake interview, the initial session, and future counseling sessions? 



A special application of clinical interviewing that professional counselors should be- 
come proficient in is called the mental status exam (MSE). The MSE is to mental 
health practitioners what the general physical examination is to medical practition- 
ers. The MSE is a quick screening of a client's intellectual, emotional, and neurolog- 
ical functioning. In general, the MSE is a brief summary narrative of client general 
mental function and is usually conducted during the first interview. MSEs are fre- 
quently required by third-party payers (i.e., insurance companies), and the level of 
detail required varies substantially. Erford (2006), in a detailed discussion that in- 
cluded a sample Semi-Structured MSE Interview Protocol, reported that a comprehen- 
sive MSE should assess the following six areas: 

1. Appearance, attitude, and behavior: manner of dress, cleanliness, appearance, 
demographic information, occupation, physical characteristics, health, size, hear- 
ing, vision, eye contact, attitude toward examiner, attitude toward interview, 
motor functioning, behavior exhibited. 

2. Cognitive capabilities: knowledge of name, location, time, day, date; long- and 
short-term memory; serial 7s; spelling a word backwards; math problem solving; 
digit span; sentence memory; level of consciousness; concentration; capacity for 
abstract reasoning; demonstration of reading, math and writing tasks; cognitive 
functioning. 

3. Speech and language: description of speech capability; description of language ca- 
pability; repetition of phrases; read a short passage; write a short passage. 

4. Thought content and process: description of thought processes; description of 
thought content; fears or phobias. 

5. Emotional status: presenting mood, intensity, duration, fluctuations; description 
of affect, intensity, range, variability; modulation and appropriateness of affect; 
personality characteristics; emotional, physical, or behavioral problems. 

6. Insight and judgment: description of insight and judgment; responses to judg- 
ment questions; decision making regarding presenting problem, past and future 
events; defense mechanisms. 

Erford (2006, pp. 172-173) provided an example mental status exam: 

Matthew was appropriately dressed in jeans and a T-shirt. He appeared clean, 
well-groomed, and relaxed. He is a 15-year-old, English-speaking, White, 



218 Chapter 7 



9th-grade male with normal physical features and no sign of handicaps, scars, 
or other signs of self-mutilation. He is approximately 5' 8", 150 pounds, and 
his hearing and vision are normal. Matthew maintained appropriate eye con- 
tact and was cooperative and open throughout the evaluation. His motor func- 
tioning was basically normal, although he did frequently "bounce his knee" 
and adjust his posture indicating signs of overactivity. He demonstrated poor 
fine-motor coordination during writing tasks and finger-touching activities. 
He did not display aggressive, irritable, anxious, or otherwise abnormal behav- 
ior throughout the evaluation. 

Cognitively, Matthew was oriented x 5 and was able to answer basic infor- 
mation questions, including the current and former president, capital of 
Maryland, serial 7s, and simple math problems. His short-term memory and de- 
layed recall was appropriate for three objects, as was his dichotic and verbal re- 
tention. His consciousness was normal. Dysgraphia was evident and should be 
ruled out through diagnostic evaluation. He was somewhat distractible in the 
one-to-one situation, but his cognitive functioning was otherwise normal. 

Matthew's speech and language capabilities were normal in all regards. His 
thought processes were clear, appropriate, and logical, and his thought content 
was normal- — devoid of phobic, obsessive, or psychotic process. Matthew's 
mood was observed to be friendly, pleasant, and calm, with normal intensity and 
little fluctuation. His affect was appropriate as he was able to modulate an ap- 
propriate affective range and intensity, even when discussing emotional content. 
He admitted being oppositional and appeared ambiverted. Matthew did not re- 
port significant emotional, physical, or behavioral problems. 

Finally, Matthew's insight and judgment appeared normal, appropriate, and 
realistic. He was able to clearly describe his decision-making processes and an- 
swer questions requiring judgment. Matthew acknowledged the problems re- 
ported by parents and teachers, willingly consented to this evaluation, and was 
willing to "do whatever it takes" to address the issues. 

The mental status exam can be administered either through an unstructured, 
semi-structured, or structured interview format and, of course, relies heavily on ob- 
servation of attitudes, behaviors, and appearance. Use of an unstructured format re- 
quires a great deal of experience with the content and format of the mental status 
exam and basically involves asking pertinent questions from the categories specified 
above. As with any unstructured interview, the questions will vary from client to 
client and occur in no particular order, maximizing the clinician's flexibility and 
adaptability to the conditions and client responses. 

An example of a comprehensive semi-structured presentation of a mental status 
examination has been mentioned earlier and can be found in Erford (2006). An ex- 
ample of a quicker, far less comprehensive mental status exam in popular use is the 
Mini-Mental State Examination (MMSE). The MMSE is a brief, structured inter- 
view used to assess only the cognitive mental state (Folstein, Folstein, McHugh, & 
Fanjiang, 2001). The MMSE\r<\s 1 1 categories and takes 5 to 10 minutes to admin- 
ister. An examiner asks questions or gives instructions, and an examinee responds 
one by one. For example, an examinee needs to (a) answer questions regarding time 



Clinical Assessment 219 

and place; (b) repeat, memorize, or recall some words; (c) briefly calculate simple 
math problems; (d) manipulate a piece of paper according to directions; and (e) copy 
a design. Summing each score (0 or 1) yields a total score, whose maximum is 30. 
Though the authors of the MMSE recommend using a total score of 26 as a cutoff 
score, a frequently used cutoff score is 23. A total score of 23 or below indicates the 
likelihood of cognitive impairment and the necessity of further evaluation (Folstein 
et al., 2001). The MMSE has been shown to produce reliable and valid scores when 
screening for cognitive impairment. An example of a structured mental status exam 
in common use is the Standardized Mini-Mental States Exam (SMMSE) ( Molloy, 
Alemayehu, & Roberts, 1991) (see Figure 7.1). Essentially, Malloy et al. took the 
MMSE and structured its administration to increase the administrative efficiency 
and enhance the interrater and internal consistency reliability of scores. 



Strengths and Limitations of Interviewing 



A clinical or behavioral interview allows the professional counselor great latitude in 
how to collect important information from clients and other stakeholders (e.g., par- 
ents, teachers, spouses). A lot of important information can be collected quickly and 
efficiently. However, it is good practice to validate this information and client per- 
ceptions against other information sources. Aside from the important demographic 
and historical information derived from an interview, the important point of con- 
ducting the interview is to generate and validate hypotheses, arrive at an understand- 
ing or diagnosis of the clients presenting concerns, and develop a plan of treatment 
or intervention to help ameliorate the client's concerns. The interview allows for in- 
depth analysis of issues, flexibility in how the information is garnered, and instanta- 
neous clarification of ambiguous information. The interview also provides the pro- 
fessional counselor with valuable insight into what has been tried previously to 
ameliorate the client's condition, how motivated the client is to enact proposed treat- 
ment strategies, and resources that the client can draw upon to effect necessary 
changes (Erford, 2006). 

But interviewing is not without limitations. Interview responses frequently pos- 
sess lower levels of reliability and validity than more standardized inventories, al- 
though structured interviews frequently rival their counterpart inventories. 
Unstructured interviews are particularly problematic in this regard because of very 
low interrater reliability. Professional counselors using unstructured clinical inter- 
views frequently derive very different information from the interview and arrive at 
very different conclusions. More specifically, clinician bias often determines which 
questions are asked, what client responses are clarified and explored in depth, and 
what diagnosis or conclusion is arrived at. 

The clinical or behavioral interview can be an important aspect of assessing 
client problems and needs. Professional counselors must use caution when interpret- 
ing interview data, just as when interpreting the results of objective tests or projec- 
tive measures. The key to competent assessment and diagnosis is using multiple 
measures from multiple respondents, resulting in convergence of information. When 
unsure, it is always advisable to collect more information. A client deserves no less. 



220 Chapter 7 

Figure 7.1 Standardized Mini-Mental State Examination (SMMSE) 



I am going to ask you some questions and give you some problems to solve. Please try to answer as best as you can. 



Max Score 



1. (Allow 10 seconds for each reply) 

a) What year is this? (accept exact answer only) 1 

b) What season is this? (during last week of the old season or first week of a new season, accept 1 
either season) 

c) What month of the year is this? (on the first day of new month, or last day of the previous month, 1 
accept either) 

d) What is today's date? (accept previous or next date, e.g., on the 7th accept the 6th or 8th) 1 

e) What day of the week is this? (accept exact answer only) 1 

2. (Allow 10 seconds for each reply) 

a) What country are we in? (accept exact answer only) 1 

b) What province/state/county are we in? (accept exact answer only) 1 

c) What city/ town are we in? (accept exact answer only) 1 

d) (In clinic) What is the name of this hospital/building? (accept exact name of hospital or institution only) 1 
(In home) What is the street address of this house? (accept street name and house number or 

equivalent in rural areas) 

e) (In clinic) What floor of the building are we on? (accept exact answer only) 1 
(In home) What room are we in? 

3. I am going to name 3 objects. After I have said all three objects, I want you to repeat them. 3 
Remember what they are because I am going to ask you to name them again in a few minutes. 

(say them slowly at approximately 1 second intervals) 

Ball Car Man 

For repeated use: 

Bell Jar Fan 

Bill Tar Can 

Bull War Pan 

Please repeat the 3 items for me. (score 1 point for each correct reply on the first attempt) Allow 20 seconds 

for reply; if subject did not repeat all 3, repeat until they are learned or up to a maximum of 5 times 

4. Spell the word WORLD, (you may help the subject to spell world correctly) Say now spell it 5 
backwards please. Allow 30 seconds to spell backwards. (If the subject cannot spell world even 

with assistance — score 0). 

5. Now what were the 3 objects that I asked you to remember? 3 
Ball Car Man 

Score 1 point for each correct response regardless of order, allow 10 seconds. 

6. Show wrisrwatch. Ask: what is this called? Score 1 point for correct response. Accept "wrisrwatch" or 1 
"watch". Do not accept "clock", "time", etc. (allow 10 seconds). 

7. Show pencil. Ask: what is this called? Score 1 point for correct response, accept pencil only — 1 
Score for pen. 

8. Id like you to repeat a phrase .liter me: "no, if s, and's, or bin's." (allow 10 seconds for response. 1 
Score 1 point for a correct repetition. Must be exact, e.g., no it's or but's — score 0) 



Clinical Assessment 221 

9. Read the words on this page and then do what it says: Hand subject the laminated sheet with CLOSE 1 

YOUR EYES on it. 

CLOSE YOUR EYES. 
If subject just reads and does not then close eyes — you may repeat: read the words on this page and then 
do what it says to a maximum of 3 times. Allow 10 seconds, score 1 point only if subject closes eyes. 
Subject does not have to read aloud. 

10. Ask if the subject is right or left handed. Alternate right/left hand in statement, e.g., if the subject is 3 
right handed, say Take this paper in your left hand . . . Take a piece of paper — hold it up in front of 

subject and say the following: 

"Take this paper in your right/left hand, fold the paper in half once with both hands, and put the 
paper down on the floor." 

Takes paper in correct hand 

Folds it in half 

Puts it on the floor 

Allow 30 seconds. Score 1 point for each instruction correctly executed. 

11. Hand subject a pencil and paper. Write any complete sentence on that piece of paper. 1 
Allow 30 seconds. Score 1 point. The sentence should make sense. Ignore spelling errors. 

12. Place design, pencil, eraser and paper in front of the subject. Say: copy this design please. Allow 1 
multiple tries until patient is finished and hands it back. Score 1 point for correctly copied diagram. 

The subject must have drawn a 4-sided figure between two 5-sided figures. Maximum time — 1 minute. 

Total Test Score 30 

Source: From D. W. Molloy, E. Alemayehu, and R. Roberts, "Reliability of a Standardized Mini-Mental State Examination compared with the 
traditional mini-mental examination." American journal of Psychiatry, January 1991; 148, 102-105. Copyright © 1991 American Psychiatric 
Association. 



COUNSELING, DIAGNOSIS, AND THE DSM-IV-TR 

The roots and tradition of counseling lie in vocational guidance and human devel- 
opment (Herr, 1998). However, recent societal and mental health practices have 
given rise to a mental health role for professional counselors regardless of work set- 
ting. Mental health counselors, substance abuse counselors, marriage and family 
counselors, geriatric counselors, and community counselors provide mental health 
counseling in clinics, agencies, and private practice in numerous states around the 
country — and in numerous countries around the world. Even professional school 
counselors and career counselors, two professions that have maintained the closest 
ties to counseling's developmental roots and that seldom view clinical diagnosis as a 
part of their job functions, provide treatment to clients or students who have been 
(or could be) diagnosed with mental or emotional disorders. 

Mental and emotional disorders are becoming more prevalent in society, partic- 
ularly among children and adolescents, and professional counselors must be knowl- 
edgeable about diagnosis and clinical assessment in order to gain respect and parity 



222 Chapter 7 



in the mental health community. A review of the extant literature finds numerous ex- 
amples of increased need for clinical diagnostic and treatment services, a need that 
contemporary professional counselors are helping to meet. In any given year, serious 
mental illness can be diagnosed in about 5-7% of an adult population (New 
Freedom Commission on Mental Health, 2003). Diagnosable mental and emotional 
disorders significant enough to warrant treatment can be found in 15-22% of 
school-aged students (SAMHSA, 1998), but only about one in five of these impaired 
students actually gets help. Clients with serious mental health concerns seeking help 
at university counseling centers are increasing (Pledge, Lapan, Heppner, Kivlighan & 
Roehlke, 1998). Substance abuse, poverty, and community and domestic violence 
are on the rise (Dryfoos, 1994; Lockhart & Keys, 1998). Various estimates of de- 
pression among adolescents include 3 to 6 million students (American Psychiatric 
Association, 1994) or nearly 18% (Essau, Condradt, & Peterman, 2000). On a re- 
lated note, 10,000 to 20,000 adolescents attempt suicide, while more than 2,000 
adolescents commit suicide annually (Brown, 1996). This makes suicide the second 
leading cause of death among adolescents. Diagnosis of childhood disorders requires 
a great deal of improvement as certain common disorders (e.g. AD/HD) appear to 
be overdiagnosed in childhood (McClure, Kubiszyn, & Kaslow, 2002), quite a feat 
given that community prevalence estimates indicate that perhaps 50% of children 
and adolescents referred to mental health clinics can be diagnosed with behavior dis- 
orders, including Conduct Disorder and AD/HD (Erk, 1995). 

While the above statistics paint a picture of a tremendous societal need for clin- 
ical services, they also underscore the necessity of high-level training in diagnosis and 
treatment of mental and emotional disorders. Nearly all clinical decisions, whether 
diagnostic or treatment related, are predicated on informal or formal assessment pro- 
cedures. Thus the more one consciously integrates assessment procedures and out- 
comes research into one's practice, the more objective and informed ones practice 
becomes. The mental health role of the professional counselor is here to stay; diag- 
nosis and use of the DSM is becoming a necessary part of training for all clinicians 
(Seligman, 1998), just as the International Classification of Diseases — Tenth Revision 
(ICD-10) is used in the health professions. 

The usefulness of diagnostic systems is widely debated (see Murphy and 
Davidshofer, 2001). The fact of the matter is that insurance companies and employ- 
ers are requiring competence in diagnosis as a condition for payment or employ- 
ment, and state licensing agencies are increasingly requiring coursework and train- 
ing in clinical diagnosis to obtain licensure (Hohensil, 1993; 1996). In the mental 
health arena, the diagnostic resource most commonly used by psychiatrists, psychol- 
ogists, social workers, and professional counselors is the Diagnostic and Statistical 
Manual of Mental Disorders — Fourth Edition — Text Revision (DSM-IV-TR) (APA, 
2000). In fact, a recent survey found that 91% of mental health counselors used the 
DSM (Mead, Hohensil, & Singh, 1997). 

The DSM-IV-TR provides specific criteria through which reliable diagnoses can 
be made. It also provides a nomenclature, or common language, through which men- 
tal health professionals can communicate with each other to describe (not label) a 
client's condition. Such diagnostic language has the purpose of succinctly communi- 
cating categorical mental conditions so that common symptoms may be indicated and 



Clinical Assessment 223 

commonly agreed-upon' treatments may ensue. Such a categorical reference is neces- 
sary to help organize the diagnostic and treatment outcome literature. For example, 
to move a field forward, it is essential for all clinicians, educators, and researchers to 
know exactly what is meant by the term Major Depressive Disorder so that all re- 
sources aimed at understanding the identification, treatment alternatives, and treat- 
ment outcomes of this disorder can be focused most efficiently. The DSM-TV- TR pro- 
vides this common language. Even if some professional counselors (e.g., professional 
school counselors and career counselors) do not make diagnoses in their work settings, 
understanding what, for example, Major Depressive Disorder entails is essential for 
proper assessment, referral, and facilitation or coordination of treatment. For exam- 
ple, would a professional school counselor interviewing the mother of a 7-year-old 
who complains of her son's problems with disobedience, defiance, and negativity be 
serving the best interest of the student or family if he or she were unfamiliar with the 
term Oppositional Defiant Disorder (ODD). An awareness of the diagnostic criteria 
for ODD would streamline the assessment process and allow for efficient referral or 
treatment. A working knowledge of the DSM-TV-TR makes any professional coun- 
selor more efficient and valuable. While there is no substitute for a careful perusal of 
the DSM-TV-TR, the remainder of this chapter briefly reviews the multiaxial assess- 
ment system of the DSM-IV-TR, major diagnostic categories, and several instruments 
that are particularly helpful in the clinical assessment process. 



Using the DS/W-/V-FR-Multiaxial Diagnosis 



The DSM-IV-TR (APA, 2000) is the latest in a series of diagnostic resource guides. 
The DSM-IV-TR is a text revision of the DSM-IV (APA, 1994), with editorial 
changes primarily to the information supplied in the text, rather than to the diagnos- 
tic criteria sets for the specified disorders. The DSM-IV-TR describes nearly 300 di- 
agnostic categories that enable mental health professionals to diagnose, treat, re- 
search, and efficiently discuss mental and emotional disorders. 

The diagnostic process calls for a multiaxial classification system to describe the 
condition of the client. Five axes, or different facets, are included: 

■ Axis I — Clinical disorders and other conditions that may be a focus of clinical 
attention 

■ Axis II — Personality disorders and mental retardation 

■ Axis III — General medical conditions 

■ Axis IV — Psychosocial and environmental problems 

■ Axis V — Global assessment of functioning 

The systematic multiaxial approach provides a shorthand notation of a compre- 
hensive process, conveying a tremendous amount of information about the current 
mental status of a client, including mental disorders, concurrent medical issues, and 
adaptive functioning. APA (2000, p. xxxi) defines a mental disorder as a 

clinically significant behavior or psychological syndrome or pattern that occurs 
in an individual and that is associated with present distress (e.g., a painful symp- 
tom) or disability (i.e., impairment in one or more areas of functioning) or with 



224 Chapter 7 



a significantly increased risk of suffering death, pain, disability, or an important 
loss of freedom. 

Axes I and II include the mental disorders that make up the classification sys- 
tem. Axis II includes personality disorders and mental retardation, while Axis I is 
used to document the existence of all other mental disorders. The behavioral effects 
of physical and medical disorders are listed on Axis III. The listing of occupational, 
familial, financial, legal, and other social and emotional effects is noted on Axis IV. 
And the professional counselor's assessment of how well the client is, or has been, 
adapting to the stresses of everyday life is recorded on Axis V. 

The DSM-IV-TR provides comprehensive information about mental disorders 
by describing essential diagnostic features, associated features and disorders, specific 
age and gender features, prevalence, course of the disorder, familial pattern, and dif- 
ferential diagnosis. Most importantly, the diagnostic code and criteria for each dis- 
order are provided. These criteria enhance the reliability and validity of the diagnos- 
tic system by providing specific descriptions of symptoms and conditions relevant to 
diagnosis. The criteria are meant to be so specific that, regardless of the clinician as- 
sessing the client, a similar diagnostic outcome should emerge. As examples, Table 
7.4 contains the diagnostic criteria for Posttraumatic Stress Disorder (PTSD) (APA, 
2000, pp. 467-468) and Table 7.5 for Attention-Deficit Hyperactivity Disorder — 
Combined Type (AD/HD) (APA, 2000, p. 92; symptom criteria only). 

Note how the specificity of the criteria allows for clinicians to reliably determine 
whether the disorder applies to a given client. This allows numerous clinicians as- 
sessing the same client to arrive at a consistent determination as to whether a client 
meets the specified diagnostic criteria. Accurate diagnosis occurs to a large extent be- 
cause professional counselors ask specific questions about client symptoms as neces- 
sary. It is better to ask a specific question or seek information of a specific nature and 
receive a negative reply than to not ask and therefore not know whether a client pres- 
ents with a given disorder. Clinical diagnosis is a process in which it is generally good 
advice and good practice to leave no stone left unturned. 

It is essential that professional counselors adhere closely to the diagnostic crite- 
ria provided in the DSM-IV-TR, as short- and long-term damage to clients can re- 
sult from misdiagnosis. In the short term, misdiagnosis can cause a client to receive 
an inappropriate treatment and accrue unnecessary expense and wasted time. In the 
long term, an incorrect diagnosis can follow a client, as insurance companies and 
healthcare professionals may make future decisions about treatment based on faulty 
past information. These entities also may not always keep such private information 
confidential. 

The remainder of this chapter provides an orientation to diagnosis and classifi- 
cation using the multiaxial framework. Professional counselors wanting additional 
training and practice with clinical diagnosis are encouraged to take graduate course- 
work in which the DSM-IV- I'R diagnostic system is prominently featured and super- 
vised training is provided. In addition, other text resources are available, including 
(he DSM-IV Casebook (Spitzer, Gibbon, Skodol, Williams, & First, 1994) and the 
DSM-IV Guide (Frances, First, & Pincus, 1995). 



Clinical Assessment 225 



Table 7.4 Diagnostic criteria for Posttraumatic Stress Disorder (PTSD) 

A. The person has been exposed to a traumatic event in which both of the following were 
present: 

(1) the person experienced, witnessed, or was confronted with an event or events that 
involved actual or threatened death or serious injury, or a threat to the physical integrity 
of self or others 

(2) the person's response involved intense fears, helplessness or horror. Note: In children, 
this may be expressed instead by disorganized or agitated behavior 

B. The traumatic event is persistently reexperienced in one (or more) of the following ways: 

(1) recurrent and intrusive distressing recollections of the event, including images, 
thoughts, or perceptions. Note: In young children, repetitive play may occur in which 
themes or aspects of the trauma are expressed 

(2) recurrent distressing dreams of the event. Note: In children, there may be frightening 
dreams without recognizable content 

(3) acting or feeling as if the traumatic event were recurring (includes a sense of reliving the 
experience, illusions, hallucinations, and dissociative flashback episodes, including those 
that occur on awakening or when intoxicated). Note: In young children, trauma specific 
reenactment may occur 

(4) intense psychological distress at exposure to internal or external cues that symbolize or 
resemble an aspect of the traumatic event 

(5) physiological reactivity on exposure to internal or external cues that symbolize or 
resemble an aspect of the traumatic event 

C. Persistent avoidance of stimuli associated with the trauma and numbing of general respon- 
siveness (not present before the trauma), as indicated by three (or more) of the following: 

(1) efforts to avoid thoughts, feelings, or conversations associated with the trauma 

(2) efforts to avoid activities, places, or people that arouse recollections of the trauma 

(3) inability to recall an important aspect of the trauma 

(4) markedly diminished interest or participation in significant activities 

(5) feeling of detachment or estrangement from others 

(6) restricted range of affect (unable to have loving feelings) 

(7) sense of foreshortened future (e.g., does not expect to have career, marriage, children, or 
a normal lifespan) 

D. Persistent symptoms of increased arousal (not present before the trauma), as indicated by 
two (or more) of the following: 

(1) difficulty falling or staying asleep 

(2) irritability or outbursts of anger 

(3) difficulty concentrating 

(4) hypervigilance 

(5) exaggerated startle response 

E. Duration of the disturbance (symptoms in Criteria B, C, and D) is more than 1 month. 

F. The disturbance causes clinically significant distress or impairment in social, occupational, 
or other important areas of functioning. 

Specify if: 

Acute: if duration of symptoms is less than 3 months 

Chronic: if duration of symptoms is 3 months or more 
Specify if: 

With Delayed Onset: if onset of symptoms is at least 6 months after the stressor 

Source: Reprinted with permission from the Diagnostic and Statistical Manual of Mental Disorders, (4th ed., 
text rev.), American Psychiatric Association. Copyright 2000, Washington, DC: Author. 



226 Chapter 7 

Table 7.5 



Diagnostic criteria for Attention-Deficit Hyperactivity Disorder- 
Combined Type (inattentive and hyperactive impulsive symptoms only) 



A. Either (1) or (2): 

(1) six (or more) of the following symptoms of inattention have persisted for at least 6 
months to a degree that is maladaptive and inconsistent with developmental level: 

Inattention 

(a) often fails to give close attention to details or makes careless mistakes in 
schoolwork, work, or other activities 

(b) often has difficulties sustaining attention in tasks and play activities 

(c) often does not seem to listen when spoken to directly 

(d) often does not follow through on instructions and fails to finish schoolwork, 
chores, or duties in the workplace (not due to oppositional behavior or failure to 
understand instructions) 

(e) often has difficulty organizing tasks and activities 

(f) often avoids, dislikes, or is reluctant to engage in tasks that require sustained mental 
effort (such as schoolwork or homework) 

(g) often loses things necessary for tasks or activities (e.g., toys, school assignments, 
pencils, books, or tools) 

(h) is often easily distracted by extraneous stimuli 
(i) is often forgetful in daily activities 

(2) six (or more) of the following symptoms of hyperactivity-impulsivity have persisted for 
at least 6 months to a degree that is maladaptive and inconsistent with developmental 
level: 

Hyperactivity 

(a) often fidgets with hands or feet or squirms in seat 

(b) often leaves seat in classroom or in other situations in which remaining seated is 
expected 

(c) often runs about or climbs excessively in situations in which it is inappropriate (in 
adolescents or adults, may be limited to subjective feelings of restlessness) 

(d) often has difficulty playing or engaging in leisure time activities quietly 

(e) is often "on the go" or acts as if "driven by a motor" 

(f) often talks excessively 

Impulsivity 

(g) often blurts out answers before questions have been completed 
(h) often has difficulty awaiting turn 

(i) often interrupts or intrudes on others (e.g., butts into conversations or games) 

Source: Reprinted with permission from the Diagnostic and Statistical Manual of Mental Disorders, (4th ed., 
text rev), American Psychiatric Association. Copyright 2000, Washington, DC: Author. 



Axis I Disorders-Clinical Disorders and Other Conditions 
That May Be a Focus of Clinical Attention 



Axis I disorders include all of the disorders from the DSM-IV-TR except for mental 
retardation and personality disorders (see Table 7.6). It is essential to understand 
from the outset that a minority of clients actually enter the clinical arena with only 



Clinical Assessment 227 



Table 7.6 DSM-IV-TR Axis I clinical disorders and other conditions 
that may be a focus of clinical attention 

Disorders usually first diagnosed in infancy, childhood, or adolescence 

1. Delirium, dementia, and amnestic and other cognitive disorders 

2. Mental disorders due to a general medical condition 

3. Substance-related disorders 

4. Schizophrenia and other psychotic disorders 

5. Mood disorders 

6. Anxiety disorders 

7. Somatoform disorders 

8. Factitious disorders 

9. Dissociative disorders 

10. Sexual and gender identity disorders 

1 1 . Eating disorders 

12. Sleep disorders 

13. Impulse-control disorders not elsewhere classified 

14. Adjustment disorders 

15. Other conditions that may be a focus of clinical attention 



a single well-defined problem. It is common for a client to obtain multiple diagnoses 
on Axis I and/or Axis II, referred to as comorbidity. Clark, Watson, and Reynolds 
(1995) found that 60-80% of clients present with comorbidity, while only about 
20-40% present with a singular diagnosis. This reality makes diagnosis of the typi- 
cal client somewhat complicated. Therefore, professional counselors must start by 
looking at the big picture of all characteristics and symptoms, then refine the ques- 
tioning to arrive at more specific categorical decisions. This diagnostic decision-mak- 
ing process is explained in more detail at the end of this chapter. Sometimes a client 
may not meet all criteria for a given disorder, so each Axis I disorder allows for the 
designation "Not Otherwise Specified" (NOS) to be used; however, this designation 
should be used with caution because it may lead to misdiagnosis and inappropriate 
treatment if misused. 

Report all applicable disorders on Axis I, specifying the primary diagnosis by 
listing it first and designating that it was the difficulty that prompted the office visit 
(in an outpatient setting, state "reason for visit") or inpatient stay (state "principle 
diagnosis"). Finally, severity specifiers may follow the disorder to denote the nature 
of the disorder. Course specifiers and descriptors include Mild, Moderate, Severe, In 
Partial Remission, In Full Remission, and Prior History (APA, 2000). Each of these 
is explained in detail. For example, Severe is described as "many symptoms in excess 
of those required to make the diagnosis, or several symptoms that are particularly se- 
vere, are present, or the symptoms result in marked impairment in social or occupa- 
tional functioning" (p. 2). 

Numerous other conditions are included in the DSM-IV-TR that present with 
clinical relevance deserving of attention, but are not considered a mental disorder. 
Many of these more developmental conditions are referred to as " V-Codes" and all 



228 Chapter 7 



Table 7.7 Other conditions that may be the focus of clinical attention 



Psychological factors affecting medical conditions 

Mental disorders 
Psychological symptoms 
Personality traits or coping style 
Maladaptive health behaviors 
Stress-related physiological response 

Medication-induced movement disorders 

Neuroleptic-induced 

Parkinsonism 

Malignant syndrome 

Acute dystonia 

Acute akathsia 

Tardive dyskinesia 

Medication-induced postural tremor 

Other Medication-induced disorder 

Adverse effects of medication NOS 

Relational problems 

Relational problem related to a mental disorder or general medical condition 
Parent-child relational problem 
Partner relational problem 
Sibling relational problem 

Problems related to abuse or neglect 

Physical abuse of child 
Sexual abuse of child 
Neglect of child 
Physical abuse of adult 
Sexual abuse of adult 

Additional conditions that may be a focus of clinical attention 

Noncompliance with treatment 

Malingering 

Adult antisocial behavior 

Child or adolescent antisocial behavior 

Borderline intellectual functioning 

Age-related cognitive decline 

Bereavement 

Academic problem 

Occupational problem 

[dentin' problem, religious or spiritual problem 

Acculturation problem 

Phase-of-life problem 



are coded on Axis I (except Borderline Intellectual Functioning). Fortunately, most 
of the conditions have titles that are self-explanatory, so rather than expanding on 
each, we present all ol these conditions in Table 7.7. 



Clinical Assessment 229 



Axis II Disorders-Personality Disorders and Mental Retardation 



Axis II disorders are inflexible and enduring conditions that cause significant impair- 
ment in social, occupational, academic, or other adaptive functioning. While most 
clients will seek or be referred for treatment because of more acute problems or men- 
tal disorders on Axis I, Axis II disorders may also be present, though not necessarily 
responsible for prompting the referral. Personality disorders also often exacerbate 
Axis I conditions. Importantly, clients presenting with Axis II disorders are fre- 
quently less capable of accurate symptom self-report. This, coupled with generally 
less precise diagnostic criteria, makes diagnosis of personality disorders a challenging 
endeavor (Fong, 1995). Axis II disorders include mental retardation and personality 
disorders. 

Personality disorders have been categorized according to the following clusters: 

■ Cluster A: Paranoid Personality Disorder, Schizoid Personality Disorder, 
Schizotypal Personality Disorder 

■ Cluster B: Antisocial Personality Disorder, Borderline Personality Disorder, 
Histrionic Personality Disorder, Narcissistic Personality Disorder 

■ Cluster C: Avoidant Personality Disorder, Dependent Personality Disorder, 
Obsessive-Compulsive Personality Disorder 

Such a clustering scheme does not preclude an individual from having co- 
occurring personality disorders across two or more clusters. In addition, the DSM- 
IV-TR allows diagnosis of Personality Disorder — NOS for individuals who display 
characteristics of one or more personality disorder but do not fulfill all specific cri- 
teria in a given classification. 



Axis Ill-Current Medical Conditions 



Axis III is utilized for the report of current general medical conditions of potential 
relevance to a client's current mental disorders or conditions and treatment (APA, 
2000). If a medical condition causes the disorder, it should not be listed on Axis III, 
as it should already be included on Axis I (e.g., Personality Change Due to a General 
Medical Condition). However, if the general medical condition is a direct physio- 
logical result of a mental disorder, then Mental Disorder Due to a General Medical 
Condition should be listed on Axis I, with the general medical condition noted on 
both Axis I and Axis III. In other words, the purpose of Axis III is to allow descrip- 
tion of medical conditions that are not the direct cause of a mental disorder, but 
which must be considered when planning a client's treatment. For example, if a 
client presents with depressive symptoms that are believed to give rise to a client's 
hypothyroidism, the Axis I diagnosis should be Mood Disorder Due to 
Hypothyroidism, With Depressive Features, and Hypothyroidism should again be 
included on Axis III. The general medical conditions used on Axis III are those not 
included in the chapter on Mental Disorders in the International Classification of 
Diseases (ICD-9-CM) and are important to include in a multiaxial diagnosis because 
these conditions may affect a managed care organization's decision to continue 



230 Chapter 7 

Table 7.8 Categories of psychosocial and environmental problems 



Problems with primary support 

Problems related to the social environment 

Educational problems 

Occupational problems 

Housing problems 

Economic problems 

Problems with access to healthcare services 

Problems related to interaction with the legal system or crime 

Other psychosocial and environmental problems 



treatment. If no Axis III diagnosis is evident, clinicians should provide the designa- 
tion "None." If the Axis III diagnosis will be made pending further evaluation, cli- 
nicians should provide the designation "Deferred." 



Axis IV-Psychosocial and Environmental Problems 



Axis IV is used to report environmental and psychosocial problems that may be in- 
fluencing diagnosis, treatment planning, and eventual prognosis of a client's mental 
disorder(s). Examples include the death or loss of a family member, close friend, or 
job; estrangement, separation, or divorce; academic problems; poverty, homelessness, 
or inadequate healthcare. For convenience, Table 7.8 lists the common categorical 
designations included in the DSM-IV-TR (APA, 2000). While these problems are 
typically listed on Axis IV, if these problems constitute the reason the client is seek- 
ing treatment, it is appropriate to list them on Axis I while specifying "Other 
Conditions That May Be the Focus of Clinical Attention." 



Axis V-Global Assessment of Functioning (CAF) 



Axis V allows the clinician to provide an assessment of the clients overall level of func- 
tioning, using what APA (2000) refers to as the Global Assessment of Functioning 
(GAF). This assessment reflects one's professional judgment and is useful in treatment 
planning and outcome assessment. The GAF indicates a client's current level of func- 
tioning unless otherwise noted; at times, the clinician may want to indicate the clients 
highest level of overall functioning during the past three months or even the previous 
year. The GAF should not involve a reflection of the client's physical or environment 
problems or limitations, only the client's functioning in the social, occupational, or 
psychological areas. Reported as "GAF = ###" on Axis V, the GAF scale ranges from 
to 100, subdivided by sublevels often 10-point ranges. The higher the GAF, the 
higher the client's level of functioning. Tible 7.9 contains the GAF scale descriptors 
(APA, 2000). Each is explained in greater detail. For example, a GAF between 41 and 
50 indicates "Serious symptoms (e.g., suicidal ideation, severe obsessional rituals, fre- 
quent shoplifting) or any serious impairment in social, occupational, or school tunc- 



Clinical Assessment 231 



Table 7.9 Global Assessment of Functioning (GAF) designations 

91-100 Superior functioning 

81-90 Absent or minimal symptoms 

71-80 Transient and expectable reactions 

61-70 Mild symptoms 

51-60 Moderate symptoms 

41-50 Serious symptoms 

31—40 Some impairments in reality testing or communication 

21-30 Delusions, hallucinations, or serious impairment in judgment 

11-20 Some danger to self or others 

1-10 Persistent danger to self or others 

Inadequate information 

Source: Reprinted with permission from the Diagnostic and Statistical Manual of Mental Disorders, (4th ed. 
text rev.), p. 34. American Psychiatric Association. Copyright 2000, Washington, DC: Author. 



tioning (e.g., no friends, unable to keep a job)," while a GAF between 51 and 60 
indicates Moderate symptoms (e.g., flat affect and circumstantial speech, occasional 
panic attacks) or moderate difficulty in social, occupational, or school functioning 
[e.g., few friends, conflicts with peers or co-workers (p.34)]." 

For instances in which a clinician might wish to separately assess individual 
components of functioning, rather than an overall level, APA (2000) provides a 
Social and Occupational Functioning Assessment Scale (SOFAS), a Global 
Assessment of Relational Functioning (GARF), and a Defensive Functioning Scale 
(DFS). 



Diagnostic Decision Making Using the DSM-IV-TR 



The five axes reviewed above can be combined to construct a systematic and com- 
prehensive DSM-IV-TR multiaxial assessment system (APA, 2000) that describes a 
client's mental disorder(s), medical condition(s), environmental and psychosocial 
factors, and overall level of functioning. The multiaxial system is designed to pro- 
vide organized, substantive communication about complex diagnostic situations. 
Professional counselors are encouraged to provide the complete five-axial diagnosis 
for every client in order to effectively communicate the diagnosis to other profes- 
sionals and plan an effective treatment regimen (Fong, 1995). 

Multiaxial diagnosis is a complicated process, and mastery requires substantial 
education, training, and practice under supervision. While master clinicians can 
sometimes reach reliable and accurate diagnostic decisions based on clinical experi- 
ence, many clinicians find it helpful to use a structured decision-making process. 
Figure 7.2 presents a structured process clinicians may find helpful in guiding diag- 
nostic decision making. This flow chart guides the clinician through a process in 
which very general questions can lead to deeper examination using the decision trees 
provided in the DSM-IV-TR. For example, consider the case of an adult undergoing 



232 Chapter 7 



a stressful divorce and employed by a company undergoing downsizing who presents 
with symptoms of depression. These symptoms have been occurring for about four 
weeks and have led to intense feelings of hopelessness, weight loss, and insomnia. 
On the flow chart, this case would be tracked through the Axis I Disorders category 
and pursued with the DSM-IV-TR decision tree for differential diagnosis of Mood 
Disorders, eventually resulting in a probable diagnosis of Major Depressive Disorder, 
Single Episode (assuming this was the first time depressive symptoms were displayed 
to this degree). If the client has no enduring personality disorders or complicating 
medical conditions, this client's symptoms may result in the following multiaxial 
diagnosis: 

■ Axis I 296.22 — Major Depressive Disorder, Single Episode, Moderate 

Without Psychotic Features 

■ Axis II None 

■ Axis III None 

■ Axis IV Disruption of family by divorce, threat of job loss 

■ Axis V GAF = 55 (current) 

Note that in the example above, the initial question involved whether a client's 
symptoms constituted a possible mental disorder. If the answer had been no, the 
process would have stopped right there because the DSM-IV-TR is helpful only in di- 
agnosing mental disorders, and no diagnosis would have been warranted. Also, note 
that the depressive symptoms were relatively recent and acute, not enduring, persist- 
ent, and inflexible. Thus it was judged that a Personality Disorder (or Mental 
Retardation) was not evident and that the condition was likely a mental disorder lo- 
cated on Axis I. If a Personality Disorder was indicated, exploration of these disor- 
ders would commence, followed by a return to consideration of Axis I disorders to 
address the more acute symptoms. When pursuing an Axis I diagnosis, the clinician 
needs to address each query in the remainder of the flow chart (and subsequent de- 
cision trees, if necessary) to ensure a comprehensive diagnosis. As mentioned above, 
most clients present with more than one condition, so the experienced clinician ap- 
proaches each client's diagnosis with an eye toward "leaving no stone unturned." 
While this complicates the diagnostic process, a comprehensive diagnosis generally 
improves the prospects for treatment, because multimodal treatment strategies can 
be undertaken to address all areas of concern. Note that the DSM-IV-TR has numer- 
ous other mental disorders that are not accounted for by the flow chart until the final 
"catchall" box on the decision tree. The burden for comprehensive diagnostic work 
always relies on the competence and experience of the clinician. For further discus- 
sions and applications of multiaxial diagnosis, the reader is referred to the DSM-IV 
Casebook (Spitzer et al., 1994). 

Finally, cultural considerations must always be monitored throughout the diag- 
nostic and treatment processes and are ultimately the responsibility of the clinician. 
The DSM-IV- 77? provides discussions of relevant cultural considerations for most of 
the disorders, and Appendix I of the DSM-IV-IR contains a glossary of culture- 
bound syndromes, including descriptions and relevance to psychopathology. 






oil 



-c j= g.E g £r 



01 *_. 

o c 

Q <U 



■s 5 -S 5 I 8 - 

£ E.E: . n~ 
o-o'Z " 



Oil 



1 ■£ ■§ 5 g o 7 



o "o < 



e i 



O lO 
; c <u o cu o in 
! do c i/i <u in r^ 



■S i 2 s ™ 



» -E .3 IT u c , 



^ <u a> = c 



w *- cl > c .~ 

O C (U J¥ «3 £ 



Oil 



l/l 


in 


X) 





c 

CD 


E 
o 


l/> 


x; 








tu 


CL 

F 


t 

O 




c 




a. 

E 


HI 

E 








cl J 


u 


§ 


in 


-O 






JC 


in 

o 



a < 

o 8"2<o 

cio«o 

00^ .£ CU fN 

2 Oil 



u - E g , 



3 O SWO « 



2 ° « !2 < In 
u o -c E ^- O o 

i .s £ s a ^ 



; s | »■§ i -i 



l. »- DOS fe 



•c <u 
.2 -o 



Q. fO 

It* 

l/> QJ 

■g'K 
Si g. 




oj _tu 3 m g a. - 



T3 (ii ni C v/> -n S?_h fl. S O "T 






Z Q 



JS « c 3 £ & o . 

»j c cu <y ^ ^ o »x> 



(13 5 O 



at <N r- 





n 


Q 


Q 


n 


• o. 


u Q- 


^ 


&% 






ai c 


u 




i! nj"D 


V 3 


US 
< 


at 






LJ 


u 


Q 


O 
CL 


gs 


< 


aj 
"g 


CO 


c .2 

■- £ 





c 


0J 


_ 


00 


























QJ 


n 


C 

o 






■n 




5 


«j 


O 


QJ 


c 
















o 




u_ 



- a 
.. Q Q a. 
< a. a. — 

(0 

■£ ° ° ■&" 



2-n . 



CD 
TJ 



to 

Q 



« 



3 
DO 



233 



234 Chapter 7 



Think About It 7.3 Think of a client or associate who is experiencing a 
mental or emotional problem. What is the problem, its severity, and its envi- 
ronmental influences and consequences? How can you explain the difficul- 
ties from a developmental perspective? What approach(es) could you use to 
help? Next, using the DSM-IV-TR, attempt to understand the individual's 
issue using the multiaxial system. What treatment approach(es) could be 
used? Finally, what similarities and differences did you note between the de- 
velopmental and clinical approaches employed? 



USING CLINICAL INVENTORIES AND TESTS IN COUNSELING 
Information Sources for Clinical and Personality Assessment 



Piedmont (2006) suggested that information on clients be gathered through four 
different and complementary sources: life outcomes, observer rating, self-report rat- 
ings, and test data (LOST). Each information source has strengths and limitations, 
but accessing information from each source frequently provides a synergistic effect 
that offers a balanced and confirmatory approach to a comprehensive evaluation. 
Life outcomes data include the factual information about a client that can often be 
collected during an intake interview: "Has the client ever been married?" "How 
many children?" "Has the client ever received counseling services in the past?" "If so, 
for what and with what result?" Each question reveals certain factual information the 
professional counselor needs to understand the client's life history, properly diagnose 
or understand current complaints, and develop an effective treatment plan. 
Generally, life outcome data are factual, unambiguous, and objective, although con- 
firmation of client report is always advisable. These data can be obtained from school 
or medical records, legal or civic records, directly from the client during a written or 
oral intake interview, or through direct assessment during a structured or semi-struc- 
tured clinical interview. Comprehensive attempts at structuring the collection of per- 
sonal histories include the Personal History Checklist (Schinka, 1989) and Mental 
Status Checklist (Schinka, 1988). 

Observer ratings involve the report of observations of clients by significant and 
informed people in their lives. Parents and teachers are often in a good position to 
rate and evaluate the behaviors of children. Likewise, spouses and some friends or 
peers make be able to provide helpful insight and observations on an adult client. 
Importantly, a rating scale is an attempt to objectify someone's subjective perceptions. 
As such, caution over the veracity and honesty of the ratings must be taken into ac- 
count. The key is to capture the perceptions of several different sources of information 
so that a clinician can perform cross-validation and determine the robustness or con- 
vergence of various informant perceptions. For example, if a client is referred for de- 
pression, reports she is depressed, and rates herself as depressed, she may very well be 
depressed. But if her parents and teachers do not rate her as being depressed, it is likely 
that something more complex is occurring. If, on the other hand, her parents and 
teachers confirm her depression, the case becomes clearer. Such is the value of other- 



Clinical Assessment 235 

report observer ratings. While observer ratings can be just as biased as self-report rat- 
ings, the bias is of a different type and therefore usually adds more clarity than confu- 
sion. Indeed, McCrae & Costa (1987) and Piedmont (1994) reported convergence of 
perspectives of observers to be robust and helpful in confirming personality traits. 
Many clinical and personality tests have observer report versions, and more of these 
include validity scales to help clinicians determine the veracity of results. 

Self-report ratings are most commonly used in clinical and personality assess- 
ment because professional counselors nearly always have direct access to the client, 
and client perspectives are essential to effective treatment planning, even in cases 
when they are less than cooperative. Self-report instruments are frequently referred 
to as objective tests, even though, like observer ratings, professional counselors are 
best advised to view them as attempts to objectify the subjective perceptions of 
clients. Some clients present themselves in a biased manner, and clinicians must be 
wary of the impact such bias may have on test results. Many self-report scales include 
validity scales to help clinicians determine likely inaccuracies of self-perception and 
outright dishonesty in a client's self-presentation. In spite of the potential limitation 
of bias, self-report rating scales have two major strengths. They allow: (1) compari- 
son of a client's self-ratings to a norm sample (i.e., are norm referenced), and (2) di- 
rect assessment of client thoughts, feelings, and behaviors, which are all facets of a 
client's mental state and personality functioning (Piedmont, 2006). Many of the 
tests reviewed throughout the remainder of this chapter are self-report inventories. 

Test data involve the use of instruments to directly assess client functioning. 
Importantly, such instrumentation measures information that clients either do not 
know they are producing, or are unaware of how the information will be interpreted. 
Physiological measures fall into this category (e.g., galvanic skin response, electro- 
cardiogram). In clinical and personality assessment, projective tests are examples of 
collecting test data. In general, projective tests present a client with ambiguous stim- 
uli, such as inkblots, incomplete sentences, or pictures about which a client tells a 
story. The client, unaware of the purpose of the activity or the meaning of responses, 
projects thoughts and feelings onto the stimuli. Clinicians then interpret these re- 
sponses to understand the client's underlying needs, drives, motivations, thoughts, 
and emotions. Test data have the advantage of being difficult to "fake," thus reduc- 
ing the opportunity to bias the results. While projective test data are certainly used 
by some clinicians for diagnostic purposes, the psychometric properties of most pro- 
jective tests do not support their use for this purpose. On the other hand, rich de- 
scription and understanding of client personality can often be derived from projec- 
tive techniques by skilled professional counselors. Thus an expanded discussion of 
projective assessment and commonly used projective tests will be provided at the end 
of this chapter within the context of personality assessment. 



How Clinical and Personality Test Content Is Developed 



Clinical and personality inventories are generally multidimensional tests composed of 
several to numerous scales. Each of these scales is supposed to provide a helpful addi- 
tion to the overall test, usually measuring some unique or important facet of the over- 
all construct being measured. Four primary methods are used to construct clinical and 



236 Chapter 7 



personality inventories: content validation, theory, empirical-criterion keying, and 
factor analysis. Content validation relies on the logical process of deductive reasoning 
to determine the items that are assigned to a given scale. Each item under considera- 
tion may be included on the scale if the test developer determines (through logical 
analysis) that it contributes to the measurement of the concept under study (e.g. 
Major Depression, Schizophrenia, General Anxiety Disorder). Scales such as the 
Woodwortb Personal Data Sheet and the Edwards Personal Preference Schedule were con- 
structed using the content validation method. 

Theories are sometimes used to develop test items and scales. The theory guides 
item development and categorical assignments to potential subscales. An example of 
a popular test designed using an underlying theory is the Myers-Briggs Type Indicator 
(MBTI), which is based on Jung's theory. To be fair, many other inventories also use 
content validation of a theory at an early phase of test development but subsequently 
use one of the next two procedures to complete the instrument design (see the dis- 
cussion of bootstrapping included in Chapter 6). 

Empirical-criterion keying is a procedure in which selected items are adminis- 
tered to both nonclinical samples (individuals without the diagnosis) and clinical 
samples (individual with the diagnosis). While this process can sometimes use com- 
plex analyses, simply put, the items that identify the clinical group and not the non- 
clinical group are selected to comprise that particular clinical scale. The MMPI-2, 
MMPI-A, and California Personality Inventory are among the better-known tests 
using the empirical-criterion key method. For example, the MMPI-2 Depression 
clinical scale (D), is comprised of 57 items, many of which are obviously related to 
depression (i.e., have face validity) and some that leave examiners wondering how 
the item could possibly be related to depression. What is the "rational" or "logical" 
connection? The connection is that the individuals with depression comprising the 
clinical sample endorsed the item significantly more frequently that the nonclinical 
sample of individuals without depression. Thus the "logic" is that there is something 
about the item that makes it relate to responses of individuals with depression, even 
though the link may not be obvious or rationally determined. 

Factor analysis has risen in prominence as a procedure for scale construction 
over the past half century due to the advent of high-speed computers. As described 
in Chapter 6, factor analysis is an item-sorting technique based on item intercorre- 
lations, and the subsequent correlation between each item and derived dimensions 
or components, called factors. The factors are subsequently named and may or may 
not be "pure measures" of any given clinical diagnosis or personality trait. Each fac- 
tor is a statistical entity that has been empirically derived and which can be studied 
and refined through further research and test development. The 16PF and NEO-PI- 
R are examples of empirically derived tests constructed through the use of factor 
analysis. Factor analysis has contributed to an explosion of clinical and personality 
inventories. Of course, the primary criticism of the use of factor analysis is that it de- 
rives statistical models of item relationships, rather than theoretical models of item 
relationships. That is, many test developers put too much faith in factor analysis and 
actually use it to design the test, rather than constructing the test using a theoretical 
model and using factor analysis to explore the dimensions underlying the test and 
confirming the original design. 



Clinical Assessment 237 

As mentioned earlier, some clinical and personality inventories use one or more 
of these three design methodologies. Regardless of the test development procedure, 
numerous studies must be undertaken to explore the reliability and validity of test 
scores across various samples and for various purposes before the test is ready for 
widespread use in clinical decision making. 



SOME COMMONLY USED CLINICAL 
ASSESSMENT INVENTORIES 



Professional counselors in clinical practice may rely heavily on objective clinical in- 
ventories when exploring a client's presenting problem, diagnosing client symptoms, 
developing a treatment plan, and determining the effectiveness of therapeutic inter- 
ventions. Numerous clinical inventories have been developed, and this section pre- 
sents a basic review of more than 1 5 of those most commonly used by professional 
counselors in clinical practice. As with any of the tests reviewed throughout this 
book, more in-depth information on administration, scoring, interpretation, and 
technical characteristics can be found in the test manual, Mental Measurements 
Yearbook reviews, and the extant literature. 



Minnesota Multiphasic Personality Inventory-Second Edition 
(MMPI-2) 



The Minnesota Multiphasic Personality Inventory — Second Edition {MMPI-2 (Butcher 
et al., 1989) is a 567-item, true-false, self-report inventory designed to assess some 
of the major patterns of personality in adults ages 18-90 years. Items measure 6 va- 
lidity indicators, 10 clinical scales (see Table 7.10), and numerous supplementary, 
clinical component, content scales, and clinical subscales (see Table 7.1 1). Some ad- 
vocate for the use of Clinical scale patterns to provide quick insight into client diag- 
nosis and personality, rather than relying on interpretation of individual scales. 
Patterns are represented by reporting the Clinical Scale numeric designation for the 
two or three highest scale scores. For example, if the client's highest score is on scale 
2 (Depression), and the client's second highest score is on scale 7 (Psychasthenia), 
the pattern would be "27." Numerous books written about the MMPI and MMPI- 
2 provide interpretive suggestions applicable to pattern analysis. 

The restandardization sample (n = 2,600) consisted of paid volunteer adults 
(1,138 men and 1,462 women) recruited from seven states, a federal Indian reserva- 
tion, and four military bases via random mailings and advertisements. Biographical 
data and information about recent stressful life events were also collected (Nichols, 
1 992). Hispanic and Asian American subgroups were underrepresented in the norma- 
tive sample, whereas Native Americans were overrepresented (Butcher et al., 2001). 

The MMPI-2 takes about 60 to 90 minutes to complete and can be scored by 
hand in 30 to 60 minutes, or in about 5 minutes by computer. Sample items in- 
clude "Spirits sometimes speak to me," "I am as happy as others seem to be," and 
"I dread the thought of a hurricane." Convenient score profiles are available to plot 



238 Chapter 7 



and transform raw scores into T scores. Test-rerest coefficients based on 82 males 
and 1 1 1 females with a median interval of seven days ranged from 0.54 (females 
on the Sc scale) to 0.93 (males on the Si scale) on the Clinical scales, 0.77 (males 
on the BIZ scale) to 0.91 (males and females on the SOD scale) on the Content 
scales, and 0.63 (males on the MAC-R scale) to 0.91 (males and females on the A 
scale) on Supplementary scales. Internal consistency estimates ranged from 0.56 
to 0.87 (except for the Pa scale, which yielded coefficients 0.34 for males and 0.39 
for females) on the Clinical scales and 0.68 (females on the TPA scale) to 0.86 
(males and females on the CYN and DEP scales, respectively) on the Content scales 
(Butcher et al., 2001). 

In general, caution is warranted when using the MMPI-2 for diagnostic pur- 
poses. Low scale reliabilities (<0.90) make the MMPI-2 more helpful as a test for un- 
derstanding individual pathology and exploring intrapersonal hypotheses than for 
making diagnoses. The MMPI-2 is a Level C instrument and requires proficiency in 
reading English at the 8th-grade level. The clinician should note the inclusion of sev- 
eral helpful validity scales. The L scale identifies individuals presenting themselves 
in a favorable light, the K scale is a measure of defensiveness, and the F scale is de- 
signed to detect clients who randomly respond, cannot understand the items, or are 
attempting to fake bad (Erford, 2006). The VRIN and TRIN (validity scales) help 
determine if a subject responded in an inconsistent or contradictory way. Although 
the MMPI-A (Adolescent version) is designed for adolescents ages 14—18 years, the 
MMPI-2 is more appropriate for 18-year-olds living independently from their par- 
ents (Butcher et al., 2001). Clinicians should also note that Hispanics, Asian 
Americans, and older women were underrepresented in the restandardization of the 



Table 7.10 MMPI-2 Clinical scale descriptions 



Clinical scale designations Description 



1 


Hs 


Hypochondriasis 


2 


D 


Depression 


3 


Hy 


Hysteria 



Excessive health concerns, somatic complaints, narcissism, self-centeredness 
Depression, brooding, discouragement, pessimism, hopelessness 
Sensory or physical complaints of no organic cause, immaturity, physical 
complaints, denial of aggression, need for affection 
Pd Psychopathic deviation Antisocial/Asocial behavior, impulsivity, immaturity, lack of concern over social 

and moral standards of conduct 
Masculine and feminine interests 

Paranoia, suspicion, hostility, psychotic behavior, cynicism, excessive moral virtue 
Anxiety, obsessions, compulsions, exaggerated fears, difficulty concentrating, 
physical complaints 

Withdrawal, social/emotional alienation, thought disturbance, bizarre sensory 
experiences, lack of ego mastery 

High energy, elated mood, low frustration tolerance, denial of social anxiety 
Introversion, shyness, neurotic maladjustment, self-depreciation 

Source Manual for Administration, Suiting, and Interpretation of the Minnesota Multiphasic Personality Inventory — Third Edition by Kuulur et al., 
(2001). Minneapolis: University of Minnesota Press. 



5 


Mf 


Masculinity/Femininity 


6 


Pa 


Paranoia 


7 


Pt 


Psychasthenia 


8 


Sc 


Schizophrenia 


9 


Ma 


1 lypomania 





Si 


Social introversion 



Clinical Assessment 239 



Table 7.1 1 Scales and subscales derived from MMPI-2 items 



Validity scales 

— Cannot Say (?) (reported as a raw score 

only, not plotted) 

VRIN — Variable response inconsistency 

TRIN — True response inconsistency 

F — Infrequency 

F B — Back F 

F p — Infrequency-Psychopathology 

L— Lie 

K — Correction 

S — Superlative self-presentation 

Superlative self-presentation subscales 

Sj — Beliefs in human goodness 

S 2 — Serenity 

S 3 — Contentment with life 

S 4 — Patience/Denial of irritability 

S 5 — Denial of moral flaws 

Clinical scales 

1 Hs — Hypochondriasis 

2 D — Depression 

3 Hy — Hysteria 

4 Pd — Psychopathic deviate 

5 Mf — Masculinity-Femininity 

6 Pa — Paranoia 

7 Pt — Psychasthenia 

8 Sc — Schizophrenia 

9 Ma — Hypomania 

Si — Social introversion 

RC (Restructured clinical) Scales 

RCd — dem — Demoralization 

RC1 — som — Somatic complaints 

RC2 — lpe — Low positive emotions 

RC3 — cyn — Cynicism 

RC4 — asb — Antisocial behavior 

RC6 — per — Ideas of persecution 

RC7 — dne — Dysfunctional negative 

emotions 

RC8 — abx — Aberrant experiences 

RC9 — hpm — Hypomanic activation 

Clinical subscales 

Harris-Lingoes subscales 
Dl — Subjective depression 
D2 — Psychomotor retardation 
D3 — Physical malfunctioning 



D4 — Mental dullness 

D5 — Brooding 

Hyl — Denial of social anxiety 

Hy2 — Need for affection 

Hy3 — Lassitude-Malaise 

Hy4 — Somatic complaints 

Hy5 — Inhibition of aggression 

Pdl — Familial discord 

Pd2 — Authority problems 

Pd3 — Social imperturbability 

Pd4 — Social alienation 

Pd5 — Self-alienation 

Pal — Persecutory ideas 

Pa2 — Poignancy 

Pa3 — Naivete 

Scl — Social alienation 

Sc2 — Emotional alienation 

Sc3 — Lack of ego mastery-cognitive 

Sc4 — Lack of ego mastery— conative 

Sc5 — Lack of ego mastery-defective 

inhibition 

Sc6 — Bizarre sensory experiences 

Mai — Amorality 

Ma2 — Psychomotor acceleration 

Ma3 — Imperturbability 

Ma4 — Ego inflation 

Social introversion subscales 

Sil — Shyness/Self-consciousness 

Si2 — Social avoidance 

Si3 — Alienation - self and others 

Content scales 

ANX — Anxiety 

FRS— Fears 

OBS — Obsessiveness 

DEP — Depression 

HEA — Health concerns 

BIZ — Bizarre mentation 

ANG — Anger 

CYN — Cynicism 

ASP — Antisocial practices 

TPA— Type A 

LSE — Low self-esteem 

SOD — Social discomfort 

FAM — Family problems 

WRK— Work interference 

TRT — Negative treatment indicators 



continued 



240 Chapter 7 



Table 7.11 continued 



Content component scales 

Fears subscales 

FRS1 — Generalized fearfulness 
FRS2 — Multiple fears 
Depression subscales 
DEP1— Lack of drive 
DEP2— Dysphoria 
DEP3 — Self-depreciation 
DEP4 — Suicidal ideation 

Health concerns subscales 
HEA1 — Gastrointestinal symptoms 
HEA2 — Neurological symptoms 
HEA3 — General health concerns 

Bizarre mentation subscales 

B1Z1 — Psychotic symptomatology 

B1Z2 — Schizotypal characteristics 

Anger subscales 

ANGl — Explosive behavior 

ANG2 — Irritability 

Cynicism subscales 

CYN1 — Misanthropic beliefs 

CYN2 — Interpersonal suspiciousness 

Antisocial practices subscales 
ASP1 — Antisocial attitudes 
ASP2 — Antisocial behavior 

Type A subscales 
TPA 1 — Impatience 
TPA2 — Gompetitive drive 

Low self-esteem subscales 
LSE1— Self-doubt 
LSE2 — Submissiveness 

Social discomfort 

S( )D 1 — Introversion 

SOD2 — Shyness 

Family problems 

FAM 1 — Family discord 

FAM2 — Familial alienation 



Negative treatment indicators 
TRT1 — Low motivation 
TRT2 — Inability to disclose 

Supplementary scales 

Personality psychopathology five scales (PSY-5) 

AGGR — Aggressiveness 

PSYC — Psychoticism 

DISC — Disconstraint 

NEGE — Negative emotionality/Neuroticism 

INTR — Introversion/Low positive emotionality 

Broad perso nality cha ract eristics 

A — Anxiety 

R — Repression 

Es — Ego strength 

Do — Dominance 

Re — Social responsibility 

Generalized emotional distress 

Mt — College maladjustment 

PK — Post-Traumatic Stress Disorder-Keane 

MDS — Marital distress 

Behavioral dyscontrol 

Ho — Hostility 

O-H — Overcontrolled hostility 

MAC-R — MacAndrew-revised 

AAS — Addiction admission 

APS — Addiction potential 

Gender Role 

GM — Gender role — masculine 

GF — Gender role — feminine 

Special Indices 

Welsh Code 

F-K Dissimulation Index 

Percent True and Percent False 

Average Profile Elevation 

Megargee Offender Classification System 

P-A-I-N Classification 



MMPI-2. Likewise, clients who fit within the lowest educational and occupational 
levels might not be appropriate candidates lor the MMPI-2 because of their under- 
representation within the normative ^standardization sample (Nichols, 1992). The 
MMPI-2 is available on audiocassette and computer-adapted software and in 
Spanish, French, and the 1 Imong languages. 



Clinical Assessment 241 



Minnesota Multiphasic Personality Inventory-Adolescent 
(MMPI-A) 



The Minnesota Multiphasic Personality Inventory — Adolescent {MMPI-A) (Butcher et 
al., 1992) is a 478-item true- false, self- report inventory designed for use with ado- 
lescents ages 14-18 years to assess some of the major patterns of personality and 
emotional disorders. The derived scales are very similar to the MMPI-2 scales listed 
in Table 7. 1 0. Items measure 6 Validity Scales, 1 Clinical Scales, 1 5 Content Scales, 
6 Supplementary Scales, and about 30 Harris-Lingoes scales. Table 7.12 provides a 
sample computerized interpretive report from the Pearson software package. As with 
any test, it is essential that any statements from computerized sources be validated 
with other clinical information. The normative sample (n = 1 ,620) was very diverse, 
although it may have oversampled a more educated population. It consisted of male 
(n = 805) and female (n = 815) adolescents ages 14-18 years living in eight U.S. 
states; one state's sample was from an American Indian reservation. There was also a 
large adolescent clinical population (n = 703). Most of these subjects were paid to 
complete the test (Butcher et al., 1992). This inventory requires a 6th-grade English 
reading level. 

Raw scores are converted to Uniform T percentile-comparable scores for inter- 
pretation through use of convenient profile forms. Different scoring keys are used 
according to gender. The MMPI-A may take up to three hours to complete and can 
be scored by hand or computer. It is a Level C instrument. Sample items include 
"I'm afraid to go home," "Others do not really love me," and "I feel uneasy out- 
doors." Test-retest reliability results range from 0.65 to 0.84 for the Clinical scales 
(Butcher et al., 1992). Strong internal consistency coefficients were reported for 4 of 
the 15 basic and clinical scales (r = 0.80+); 7 of 15 were between r = 0.60 and 0.80. 
Two response set indicators ( VRIN and TRIN) are validity scales that show a respon- 
dent's patterns of responding in an inconsistent or contradictory manner (Butcher et 
al., 1992). The MMPI-A is one of the only adolescent clinical inventories to compre- 
hensively incorporate a number of validity scales to evaluate client response sets 
(Archer & Krishnamurthy, 2002). Unfortunately, fewer MMPI-A items demonstrate 
the same discriminative value in differentiating clients from normal and clinical sam- 
ples than the adult version of the test (Archer & Handel, 2001). 

Bright 1 2- and 1 3-year-olds can also be tested, as well as 1 8-year-olds who have 
completed high school (Lanyon, 1995). As a Level C instrument, examiners are re- 
quired to undergo training and supervision prior to administration, scoring, and in- 
terpretation of this test (Butcher et al., 1992). The MMPI-A has a number of unique 
features appropriate for its intended use with adolescents, yet several of the scale la- 
bels seem outdated and/or offensive (i.e., Masculine-Feminine, Hypomania, 
Hysteria, and Psychopathic Deviate) (Claiborn, 1995). "Clinicians should recognize 
that not all adolescents have the necessary skills to complete the MMPI-A" if their 
reading comprehension skills are inadequate or if their cultural background and life 
experiences are out of the range of the test (Butcher et al., 1992, p. 27). (Special 
learning problems and English as a second language may prohibit the prerequisite 
reading comprehension, including idioms or other cultural meanings.) It may be 
prudent to break the testing up into smaller sessions because some adolescents may 



242 Chapter 7 



Table 7.12 MMPI-A The Minnesota Report: Adolescent System Interpretive Report 
by Butcher & Williams for Rachel, female, age 15, outpatient mental 
health center 

Validity Considerations 

She had a tendency to inconsistently respond True without adequate attention to item meaning. 
Although herTRIN score is not elevated enough to invalidate her MMPI-A, caution is suggested 
in interpreting and using the resulting profiles (see Figure 7.3). 

Symptomatic Behavior 

This adolescent is immature, impulsive, and hedonistic, and she frequently rebels against 
authority. She may be hostile, aggressive, and frustrated. She seems unable to learn from 
punishing experiences and repeatedly gets into the same type of trouble. Many young people 
with this clinical profile develop severe acting-out problems and have legal, family, or school 
difficulties. This individual's nonconforming and impulsive lifestyle probably includes alcohol or 
drug problems. 

Many externalizing behavior problems are likely. Her friends are frequently in trouble. 
They may cheat others and lie to avoid problems. They show little remorse for their misbehavior. 
If their difficulties pile up, they may run away. 

The highest clinical scale (see Figure 7.4) in her MMPI-A clinical profile, Pd, occurs with 
very high frequency in adolescent alcohol/drug or psychiatric treatment units. Over 24% of girls 
in treatment settings have this well-defined peak score (i.e., with the Pd scale at least 5 points 
higher than the next scale). The Pd scale is among the least frequently occurring peak elevations 
in the normative girls' sample (about 3%). 

Her MMPI-A Content scales profile (see Figure 7.5) reveals important areas to consider in 
her evaluation. She endorsed a number of very negative opinions about herself. She reported 
feeling unattractive, lacking self-confidence, feeling useless, having little ability and several faults, 
and not being able to do anything well. She may be easily dominated by others. 

She reported numerous problems in school, both academic and behavioral. She has limited 
expectations of success in school and is not very interested or invested in succeeding. She 
reported several symptoms of anxiety, including tension, worries, and difficulties sleeping. 
Symptoms of depression were reported. 

Interpersonal Relations 

She may appear charming and tends to make a good first impression, but she is selfish, 
hedonistic, and untrustworthy in interpersonal relations. She seems interested only in her own 
pleasure and is insensitive to the needs of others. She seems unable to experience guilt over 
causing others trouble. 

Because she is unable to form stable, warm relationships, her current relationships are likely 
to be quite strained. In addition, she is likely to be openly hostile and resentful at times. 

Some interpersonal issues are suggested by her MMPI-A Content scales profile. Family 
problems are quite significant in this person's life. She reports numerous problems with her 
parents and other family members. She describes her family in terms of discord, jealousy, fault 
finding, anger, serious disagreements, lack of love and understanding, and very limited 
communication. She looks forward to the day when she can leave home for good, and she does 
not feel that she can count on her family in times of trouble. 1 ler parents and she often disagree 
about her friends. She indicates that her parents treat her like a child and frequently punish her 
without cause. 1 ler family problems probably have a negative effect on her behavior in school. 
She feck uncomfortable emotional distance from others. She may believe that other people do 

continued 



Clinical Assessment 



243 



110 




' 












110 


100 
















100 


90 
















90 


80 
















80 


70 
















70 


60 




60 


50 
















50 


40 
















40 


30 
















30 




VRIN 


TRIN 


Fl 


F2 


F 


L 


K 




Raw Score: 


6 


13 


12 


10 


22 


4 


13 




T Score: 


58 


73 


79 


62 


70 


59 


53 




Response %: 


100 


100 


100 


100 


100 


100 


100 




Cannot Say (Raw): 





Percent True: 


54 


Percent False: 


46 





Figure 7.3 MMPI-A validity pattern 



no 

100 
90 
80 

70 


A.. 




60 
50 
40 
30 


/ \s ^\/ ^\ 


1 >* 


/ \ 


\ / ^ 


J 


V 



110 

100 

90 

80 

70 

60 

50 

40 

30 





Hs 


D 


Hy 


Pd 


Mf 


Pa 


Pt 


Sc 


Ma 


Raw Score: 


6 


25 


26 


38 


25 


20 


29 


40 


28 


T Score: 


44 


57 


55 


84 


59 


68 


59 


68 


64 



Si MAC-RACK PRO IMM A R 
27 23 1 24 27 28 13 
50 58 39 67 74 64 49 



Response %: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 

Figure 7.4 MMPI-A Basic and Supplementary Scales profile 



244 Chapter 7 



no 
100 

90 
80 
70 
60 
50 
40 
30 




110 

100 

90 

80 

70 

60 

50 

40 

30 



ANX OBS DEP HEA ALN BIZ ANG CYN CON LSE LAS SOD FAM SCH TRT 

Raw Score: 15 12 18 5 12 4 9 14 13 14 10 4 23 11 15 

T-Score: 65 64 68 44 69 50 49 50 63 77 66 43 73 67 64 

Response °/o: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 



Figure 7.5 MMPI-A Content Scales profile 



Table 7.12 continued 



not like, understand, or care about her. She reports having no one, including parents or friends, 
to rely on. 

Behavioral Stability 

The relative elevation of the highest scale (Pd) in her clinical profile shows very high profile 
definition. Her peak scores are likely to remain very prominent in her profile pattern it she is 
retested at a later date. Her clinical profile tends to be associated with long-standing behavior 
problems. 

Diagnostic Considerations 

A diagnosis of one of the disruptive behavior disorders is highly likely given her elevations on I'd 
and A-con. 

Given her elevation on the School Problems scale, her diagnostic evaluation could include 
assessment of possible academic skills deficits and behavior problems. Academic 
underachievement, a general lack of interest in any school activities, and low expectations of 
success are likely to play a role in her problems. Her endorsement of a significant number of 
depressive symptoms should be considered when arriving at a diagnosis. 

She appears to be having difficulties ih.it may involve the use of alcohol or other drugs. 
Adolescents with high scores on the PRO scale are usually involved with .1 peer group that uses 
alcohol or other drugs. This individual's involvement in an alcohol- or drug-using lifestyle 
should be further evaluated. Her use of alcohol or other drugs may be contributing 10 problems 
.it home <>r in school. I [owever, she has not acknowledged through her item responses chat she 
has problems with alcohol or other dings. 



Clinical Assessment 245 



Treatment Considerations • 

Her serious conduct disturbance should figure prominently in any treatment planning. Her 
Clinical scales profile suggests that she is a poor candidate for traditional, insight-oriented 
psychotherapy. A behavioral strategy is suggested. Clearly stated contingencies that are 
consistently followed are important for shaping more appropriate behaviors. Punishment 
techniques seem to have more limited success than positive rewards for appropriate behaviors. 
Treatment in a more controlled setting may need to be considered if there is no improvement in 
her behavior. 

Her very high potential for developing alcohol or drug problems requires attention in 
therapy if important life changes are to be made. However, her relatively low awareness of or 
reluctance to acknowledge problems in this area might impede treatment efforts. 

She should be evaluated for the presence of suicidal thoughts and any possible suicidal 
behaviors. If she is at risk, appropriate precautions should be taken. 

Her family situation, which is full of conflict, should be considered in her treatment 
planning. Family therapy may be helpful if her parents or guardians are willing and able to work 
on conflict resolution. However, if family therapy is not feasible, it may be profitable during the 
course of her treatment to explore her considerable anger at and disappointment in her family. 
Alternate sources of emotional support from adults (e.g., foster parent, teacher, other relative, 
friend's parent, or neighbor) could be explored and facilitated in the absence of caring parents. 

There are some symptom areas suggested by the Content scales profile that the therapist 
may wish to consider in initial treatment sessions. Her endorsement of internalizing symptoms 
of anxiety and depression could be explored further. 

She endorsed some items that indicate possible difficulties in establishing a therapeutic 
relationship. She may be reluctant to self-disclose, she may be distrustful of helping professionals 
and others, and she may believe that her problems cannot be solved. She may be unwilling to 
assume responsibility for behavior change or to plan for her future. 

This adolescent's emotional distance and discomfort in interpersonal situations must be 
considered in developing a treatment plan. She may have difficulty self-disclosing, especially in 
groups. She may not appreciate receiving feedback from others about her behavior or problems. 

Note: This MMPI-A interpretation can serve as a useful source of hypotheses about adolescent 
clients. This report is based on objectively derived scale indexes and scale interpretations that 
have been developed with diverse groups of clients from adolescent treatment settings. The 
personality descriptions, inferences, and recommendations contained herein need to be verified 
by other sources of clinical information because individual clients may not fully match the 
prototype. The information in this report should most appropriately be used by a trained, 
qualified test interpreter. The information contained in this report should be considered 
confidential. 



Source: MMPI-A, Minnesota Multiphasic Personality Inventory — Adolescent and The Minnesota Report 
trademarks of the Regents of the University of Minnesota. Distributed exclusively by NCS Pearson, Inc., 
Minneapolis, MN. Copyright 1992 license from the Regents of the University of Minnesota. All rights 
reserved. Reprinted by permission of the University of Minnesota. 



be too easily distracted or unable to complete the test in one sitting (Butcher et al., 
1992). The MMPI-A is a good tool that can help to measure psychopathology in 
adolescents (Archer & Krishnamurthy, 2002; Claiborn, 1995) and is very useful in 
planning, directing, and evaluating treatment (Lanyon, 1995). 



246 Chapter 7 



Millon Clinical Multiaxial Inventory-Ill (MCMI-W) 



The Millon Clinical Multiaxial Inventory — HI {MCMI-III) (Millon, Davis, & 
Millon, 1997) is a 175-item, true-false, self-report inventory designed to provide di- 
agnostic and treatment information to clinicians in the areas of personality disorders 
and clinical syndromes. Scale items measure 1 type of Clinical Personality Pattern 
(Schizoid, Avoidant, Depressive, Dependent, Histrionic, Narcissistic, Antisocial, 
Aggressive [Sadistic], Compulsive, Passive-Aggressive [Negativistic], Self-Defeating); 
3 Severe Personality Pathologies (Schizotypal, Borderline, Paranoid); 7 Clinical 
Syndromes (Anxiety, Somatoform, Bipolar: Manic, Dysthymia, Alcohol 
Dependence, Drug Dependence, Post-Traumatic Stress Disorder); 3 Severe Clinical 
Syndromes (Thought Disorder, Major Depression, Delusional Disorder), and 4 
Modifying Indices (Disclosure, Desirability, Debasement, Validity). These scales are 
grouped to reflect distinctions between acute clinical disorders pertinent to the 
DSM-IV Axis I and the enduring personality characteristics found on DSM-IV Axis 
II (Millon et al., 1997). The total normative population [n = 998) consisted of male 
and female volunteer adults ages 18-88 years from 26 states and Canada (develop- 
ment sample n = 600 and cross-validation sample n = 398). 

Except for Scale V (Validity) raw scores, raw scores are converted to Base Rate 
(BR) scores for interpretation. Different BR transformation tables are used for males 
and females and provide cutoff points on the continuums for the 24 clinical scales 
(BR = raw score 0, BR 60 = median raw score, BR 115 = highest raw score). A BR 
score of 75 or higher is an indication of psychopathology (Millon et al., 1997; 
Erford, 2006). The MCMI-III usually requires about 20 to 30 minutes to complete 
and can be scored by hand and interpreted in about 20 to 40 minutes. It can also be 
sent to the publisher by mail, or scored by onsite computer software in about 5 min- 
utes (Erford, 2006). Sample items include "I've become very anxious lately," "I often 
feel tired," and "I often make people angry." Internal consistency reliabilities range 
from 0.66 for the Compulsive scale to 0.90 for the Major Depression scale. Twenty 
of the 24 scales have reliabilities of 0.80 or higher. Test-retest reliability results range 
from 0.82 to 0.96 for a 5- to 14-day interval (Millon et al., 1997). The median sta- 
bility coefficient is 0.91, which provides high stability for use of the test over short 
periods. Criterion-related validity correlations are moderate in magnitude (Erford, 
2006). 

The MCMI-III Is designed for adults 18 years and older who are seeking, or are 
in, mental health treatment. Since the MCMI-III is a Level C instrument, examin- 
ers are required to have "a graduate degree in psychology or a related field, or appro- 
priate licensure, a course in testing theory, coursework in personality theory, or ab- 
normal psychology, and appropriate experience under supervision" (Erford, 2006, p. 
41). The MCMI-IIfs theoretical conceptualization and prototypes are familiar to 
many clinicians because they are often covered in graduate coursework and clinical 
literature. "Because it also offers scales measuring clinical syndromes (Axis I of the 
DSM-IV), the diagnostician does not have to resort to a different instrument in order 
to assess those areas of functioning" (( Ihoca, 2001 , p. 766). Clinicians can also make 
adjustments to the CUtofl scores that place a client along a continuum of pathology 



Clinical Assessment 247 

based on estimates of the prevalence rate within a particular setting or local area 
(Widiger, 2001). Weaknesses of the MCMI-III include the complex hand scoring 
process, overrepresentation of Whites and people who differ in levels of educational 
experience, and underrepresentation of most minority groups (Erford, 2006). Use 
with various cultures (e.g., Korean) must be undertaken with caution (Erford, 2006; 
Gunsalus & Kelly, 2001). Table 7.13 provides a computerized interpretive report for 
the protocol of a 44-year old, divorced, White female outpatient. As always, infor- 
mation from a computerized report must be validated by other clinical information. 

Table 7.13 MCMI-III sample computerized interpretive report of a female, age 44, 
White, divorced outpatient never hospitalized (Millon) 

Capsule summary 

MCMI-III reports are normed on patients who were in the early phases of assessment or 
psychotherapy for emotional discomfort or social difficulties. Respondents who do not fit this 
normative population or who have inappropriately taken the MCMI-III for nonclinical purposes 
may have distorted reports. The MCMI-III report cannot be considered definitive. It should be 
evaluated in conjunction with additional clinical data. The report should be evaluated by a 
mental health clinician trained in the use of psychological tests. The report should not be shown 
to patients or their relatives. 

Interpretive considerations 

The client is a 44-year-old divorced White female. She is currently being seen as an outpatient, 
and she did not identify specific problems and difficulties of an Axis I nature in the demographic 
portion of this test. 

This patient's response style may indicate a tendency to magnify illness, an inclination to 
complain, or feelings of extreme vulnerability associated with a current episode of acute turmoil. 
The patient's scale scores may be somewhat exaggerated, and the interpretations should be read 
with this in mind. 

Profile severity 

On the basis of the test data, it may be assumed that the patient is experiencing a severe mental 

disorder, further professional observation and inpatient care may be appropriate. The text of the 

following interpretive report may need to be modulated upward given this probable level of 

severity. 

Possible diagnoses 

She appears to fit the following Axis II classifications best: Negativistic (Passive-Aggressive) 
Personality Disorder, and Borderline Personality Disorder, with Dependent Personality Traits, 
and Depressive Personality Traits. 

Axis I clinical syndromes are suggested by the client's MCMI-III profile in the areas of 
Major Depression (recurrent, severe, without psychotic features), Generalized Anxiety Disorder, 
and Psychoactive Substance Abuse NOS (see Figure 7.6). 

Therapeutic considerations 

Inconsistent and pessimistic, this patient may expect to be mishandled, if not harmed, even by 
well-intentioned therapists. Sensitive to messages of disapproval and lack of interest, she may 
complain excessively and be irritable and erratic in her relations with therapists. Straightforward 
and consistent communication may moderate her dependent/negativistic attitude. Focused, brief 
treatment approaches are likely to overcome her initial oppositional outlook. 

continued 



248 Chapter 7 



Category 



Score 
Raw 



BR 



Profile of BR Scores 
60 75 



Diagnostic Scales 



85 



115 



Modifying 
Indices 



163 
4 
28 



93 
20 
91 



Disclosure 
Desirability 
Debasement 



Clinical 

Personality 

Patterns 



1 

2A 

2B 

3 

4 

5 

6A 

6B 

7 

8A 

8B 



13 
20 
20 
22 

7 
12 
14 
14 

8 

24 
13 



64 
86 
87 
88 
16 
46 
66 
56 
16 
58 
71 



Schizoid 

Avoidant 

Depressive 

Dependent 

Histrionic 

Narcissistic 

Antisocial 

Sadistic 

Compulsive 

Negativistic 

Masochistic 



Severe S 

Personality C 
Pathology p 



16 
23 
15 



64 
95 
70 



Schizotypal 
Borderline 
Paranoid 



Clinical 
Syndromes 



17 
13 
11 
17 
8 
14 
18 



95 
76 
63 
76 
61 
82 
76 



Anxiety Disorder 
Somatoform Disorder 
Bipolar: Manic Disorder 
Dysthymic Disorder 
Alcohol Dependence 
Drug Dependence 
Post-traumatic Stress 



Severe SS 

Clinical CC 

Syndromes PP 



17 
21 

7 



66 
99 
66 



Thought Disorder 
Major Depression 
Delusional Disorder 



Figure 7.6 MCMI-III profile for female, age 44 

Sin i in-: < opyrighi " i 1994 I >l< ANDRIEN, [nc, All rights reserved, Reprinted In permission ol Pearson Assessments, NCS Pearson, liu 
M( Ml 111 1 '-' and Milion™ art trademarks ol I >I< USTDRII N, Inc. 



Clinical Assessment 249 

Table 7.13 continued ■ 

Response tendencies 

This patient's response style may indicate a broad tendency to magnify the level of experience 
illness or a characterological inclination to complain or to be self-pitying. On the other hand, 
the response style may convey feelings of extreme vulnerability that are associated with a current 
episode of acute turmoil. Whatever the impetus for the response style, the patient's scale scores, 
particularly those on Axis I, may be somewhat exaggerated, and the interpretation of this profile 
should be made with this consideration in mind. 

The BR scores reported for this individual have been modified to account for the high self- 
revealing inclinations indicated by the high raw score on Scale X (Disclosure) and the psychic 
tension and dejection indicated by the elevations on Scale A (Anxiety) and Scale D (Dysthymia). 

Axis II: Personality patterns 

The following paragraphs refer to those enduring and pervasive personality traits that underlie 
this woman's emotional, cognitive, and interpersonal difficulties. Rather than focus on the 
largely transitory symptoms that make up Axis I clinical syndromes, this section concentrates on 
her more habitual and maladaptive methods of relating, behaving, thinking, and feeling. 

There is reason to believe that at least a moderate level of pathology characterizes the overall 
personality organization of this woman. Defective psychic structures suggest a failure to develop 
adequate internal cohesion and a less than satisfactory hierarchy of coping strategies. This 
woman's foundation for effective intrapsychic regulation and socially acceptable interpersonal 
conduct appears deficient or incompetent. She is subjected to the flux of her own enigmatic 
attitudes and contradictory behavior, and her sense of psychic coherence is often precarious. She 
has probably had a checkered history of disappointments in her personal and family 
relationships. Deficits in her social attainments may also be notable as well as a tendency to 
precipitate self-defeating vicious circles. Earlier aspirations may have resulted in frustrating 
setbacks, and efforts to achieve a consistent niche in life may have failed. Although she is usually 
able to function on a satisfactory basis, she may experience periods of marked emotional, 
cognitive, or behavioral dysfunction. 

The MCMI-III profile of this woman suggests her marked dependency needs, deep and 
variable moods, and impulsive, angry outbursts. She may anxiously seek reassurance from others 
and is especially vulnerable to fear of separation from those who provide support, despite her 
frequent attempts to undo their efforts to be helpful. Dependency fears may compel her to be 
alternately overly compliant, profoundly gloomy, and irrationally argumentative and negativistic. 
Almost seeking to court undeserved blame and criticism, she may appear to find circumstances 
to anchor her feeling that she deserves to suffer. 

She strives at times to be submissive and cooperative, but her behavior has become 
increasingly unpredictable, irritable, and pessimistic. She often seeks to induce guilt in others for 
failing her, as she sees it. Repeatedly struggling to express attitudes contrary to her feelings, she 
may exhibit conflicting emotions simultaneously toward others and herself, most notable are 
love, rage, and guilt. Also notable may be her confusion over her self-image, her highly variable 
energy levels, easy fatigability, and her irregular sleep-wake cycle. 

She is particularly sensitive to external pressure and demands, and she may vacillate 
between being socially agreeable, sullen, self-pitying, irritably aggressive, and contrite. She may 
make irrational and bitter complaints about the lack of care expressed by others and about being 
treated unfairly. This behavior keeps others on edge, never knowing if she will react to them in a 
cooperative or a sulky manner. Although she may make efforts to be obliging and submissive to 
others, she has learned to anticipate disillusioning relationships, and she often creates the 

continued 



250 Chapter 7 



Table 7.13 continued 



expected disappointment by constantly questioning and doubting the genuine interest and 
support shown by others. Self-destructive acts and suicidal gestures may be employed to gain 
attention. These irritable testing maneuvers may exasperate and alienate those on whom she 
depends. When threatened by separation and disapproval, she may express guilt, remorse, and 
self-condemnation in the hope of regaining support, reassurance, and sympathy. 

Axis I: Clinical syndromes 

The features and dynamics of the following Axis I clinical syndromes appear worthy of 
description and analysis. They may arise in response to external precipitants but are likely to 
reflect and accentuate several of the more enduring and pervasive aspects of this woman's basic 
personality makeup. 

Testy and demanding, this woman evinces an agitated, major depression that can be noted 
by her daily moodiness and vacillation. She is likely to display a rapidly shifting mix of 
disparaging comments about herself, anxiously expressed suicidal thoughts, and outbursts of 
bitter resentment interwoven with a demanding irritability toward others. Feeling trapped by 
constraints imposed by her circumstances and upset by emotions and thoughts she can neither 
understand nor control, she has turned her reservoir of anger inward, periodically voicing severe 
self-recrimination and self-loathing. These signs of contrition may serve to induce guilt in 
others, an effective manipulation in which she can give a measure of retribution without further 
jeopardizing what she sees as her currently precarious, if not hopeless, situation. 

Failing to keep deep and powerful sources of inner conflict from overwhelming her 
controls, this characteristically difficult and conflicted woman may be experiencing the clinical 
signs of an anxiety disorder. She is unable to rid herself of preoccupations with her tension, 
fearful presentiments, recurring headaches, fatigue, and insomnia, and she is upset by their 
uncharacteristic presence in her life. Feeling at the mercy of unknown and upsetting forces that 
seem to well up within her, she is at a loss as to how to counteract them, but she may exploit 
them to manipulate others or to complain at great length. 

Abuse of either legal or street drugs or both is indicated in the MCMI-III protocol of this 
woman, who is often erratic, irritable, and negativistic. Her use of drugs may be both a 
statement of resentful independence from the constraints of conventional life and a means of 
disjoining her conflicts and liberating her uncharitable impulses toward others. An act of 
assertive defiance that has undertones of self-destruction, her drug abuse may be employed with 
a careless indifference to its consequences. 

Related to but beyond her characteristic level of emotional responsivity, this woman 
appears to have been confronted with an event or events in which she was exposed to a severe 
threat to her life, a traumatic experience that precipitated intense fear or horror on her part. 
Currently, the residuals of this even resemble or symbolize an aspect of the traumatic event. 
Where possible, she seeks to avoid such cues and recollections. Where they cannot be anticipated 
and actively avoided, as in dreams or nightmares, she may become terrified, exhibiting a number 
of symptoms of intense anxiety. Other signs of distress might include difficulty falling asleep, 
outbursts of anger, panic attacks, hypervigilance, exaggerated startle response, or a subjective 
sense of numbing and detachment. 

This moody and conflicted woman's bodily preoccupations and concerns are likely to be 
produced by both physical and psychological factors, resulting in a syndrome of features 
suggestive of a somatoform disorder. Enmeshed in an erratic pattern of resentment and brittle 
emotions, her anxious concerns about her somatic state aggravate her characteristic sullenness, 
leading her to demand attention and special treatment. Not only does she exploit her ailments to 
control the lives ol others, but sin- is also likely to complain ol her discomfort in ways chat 
induce others t<> feel guilty. 



Clinical Assessment 251 



Possible DSM-IV multiaxial diagnoses 

The following diagnostic assignments should be considered judgments of personality and clinical 
prototypes that correspond conceptually to formal diagnostic categories. The diagnostic criteria 
and items used in the MCMI-III differ somewhat from those in the DSM-IV, but there are 
sufficient parallels in the MCMI-III items to recommend consideration of the following 
assignments. It should be noted that several DSM-IV Axis I syndromes are not assessed in the 
MCMI-III. Definitive diagnoses must draw on biographical, observational, and interview data in 
addition to self-report inventories such as the MCMI-III. 

Axis I: Clinical syndrome 

The major complaints and behaviors of the patient parallel the following Axis I diagnoses, listed 

in order of their clinical significance and salience. 

296.33 Major Depression (recurrent, severe, without psychotic features) 

300.02 Generalized Anxiety Disorder 

305.90 Psychoactive Substance Abuse NOS 

Axis II: Personality disorders 

Deeply ingrained and pervasive patterns of maladaptive functioning underlie Axis I clinical 
syndromal pictures. The following personality prototypes correspond to the most probable 
DSM-IV diagnoses (Disorders, Traits, Features) that characterize this patient. 
Personality configuration composed of the following: 

301.84 Negativistic (Passive- Aggressive) Personality Disorder 

301.83 Borderline Personality Disorder with Dependent Personality Traits and Depressive 

Personality Traits 
Course: The major personality features described previously reflect long-term or chronic traits 
that are likely to have persisted for several years prior to the present assessment. The clinical 
syndromes described previously tend to be relatively transient, waxing and waning in their 
prominence and intensity depending on the presence of environmental stress. 

Axis IV: Psychosocial and environmental problems 

In completing the MCMI-III this individual identified the following problems that may be 
complicating or exacerbating her present emotional state. They are listed in order of importance 
as indicated by the client. This information should be viewed as a guide for further investigation 
by the clinician. 
None identified 

Treatment guide 

If additional clinical data are supportive of the MCMI-III's hypotheses, it is likely that this 
patient's difficulties can be managed with either brief or extended therapeutic methods. The 
following guide to treatment planning is oriented toward issues and techniques of a short-term 
character, focusing on matters that might call for immediate attention, followed by time-limited 
procedures designed to reduce the likelihood of repeated relapses. 

As a first step, it would appear advisable to implement methods to ameliorate this patient's 
current state of clinical anxiety, depressive hopelessness, or pathological personality functioning 
by the rapid implementation of supportive psychotherapeutic measures. With appropriate 
consultation, targeted psychopharmacologic medications may also be useful at this initial stage. 

Worthy of note is the possibility of a troublesome alcohol and/or substance-abuse disorder. 
If verified, appropriate short-term behavioral management or group therapy programs should be 
rapidly implemented. 

continued 



252 Chapter 7 



Table 7.13 continued 



Once this patient's more pressing or acute difficulties are adequately stabilized, attention 
should be directed toward goals that would aid in preventing a recurrence of problems, focusing 
on circumscribed issues and employing delimited methods such as those discussed in the 
following paragraphs. 

A primary short-term goal of treatment with this patient is to aid her in reducing her 
intense ambivalence and growing resentment of others. With an empathic and brief focus, it 
should be possible to sustain a productive, therapeutic relationship. With a therapist who can 
convey genuine caring and firmness, she may be able to overcome her tendency to employ 
maneuvers to test the sincerity and motives of the therapist. Although she will be slow to reveal 
her resentment because she dislikes being viewed as an angry person, it can be brought into the 
open, if advisable, and dealt with in a kind and understanding way. She is not inclined to face 
her ambivalence, but her mixed feelings and attitudes must be a major focus of treatment. To 
prevent her from trying to terminate treatment before improvement occurs or to forestall 
relapses, the therapist should employ brief and circumscribed techniques to counter the patient's 
expectation that supportive figures will ultimately prove disillusioning. 

Circumscribed interpersonal approaches (e.g., Benjamin, Kiesler) may be used to deal with 
the seesaw struggle enacted by the patient in her relationship with her therapist. She may 
alternately exhibit ingratiating submissiveness and a taunting and demanding attitude. Similarly, 
she may solicit the therapist's affections, but when these are expressed, she may reject them, 
voicing doubt about the genuineness of the therapist's feelings. The therapist may use cognitive 
procedures to point out these contradictory attitudes. It is important to keep these 
inconsistencies in focus or the patient may appreciate the therapist's perceptiveness verbally but 
not alter her attitudes. Involved in an unconscious repetition-compulsion in which she recreates 
disillusioning experiences that parallel those of the past, the patient must not only come to 
recognize the expectations cognitively but may be taught to deal with their enactment 
interpersonally. 

Despite her ambivalence and pessimistic outlook, there is good reason to operate on the 
premise that the patient can overcome past disappointments. To capture the love and attention 
only modestly gained in childhood cannot be achieved, although habits that preclude partial 
satisfaction can be altered in the here and now. Toward that end, the therapist must help her 
disentangle needs that are in opposition to one another. For example, she both wants and does 
not want the love of those upon whom she depends. Despite this ambivalence, she enters new 
relationships, such as in therapy, as if an idyllic state could be achieved. She goes through the act 
of seeking a consistent and true source of love, one that will not betray her as she believes her 
parents and others did in the past. Despite this optimism, she remains unsure of the trust she 
can place in others. Mindful of past betrayals and disappointments, she begins to test her new 
relationships to see if they are loyal and faithful. In a parallel manner, she may attempt to irritate 
and frustrate the therapist to check whether he or she will prove to be as fickle and insubstantial 
as others have in the past. It is here that the therapist's warm support and firmness can play a 
significant short-term role in reframing the patient's erroneous expectations and in exhibiting 
consistency in relationship behavior. 

Although the rooted character of these attitudes and behavior will complicate the ease with 
which these therapeutic procedures will progress, short-term and circumscribed cognitive and 
interpersonal therapy techniques may be quite successful. A thorough reconstruction of 
personality may not be necessary to alter the patient's problematic pattern. In this regard, family 
treatment methods that focus on the network of relationships that often sustain her problems 
may prove to be a useful technique. Group methods may also be fruitfully employed to help the 
patient acquire self-control and consistency in close relationships. 



Clinical Assessment 253 



It is advisable that the therapist not set goals too high because the patient may not be able 
to tolerate demands or expectations well. Brief therapeutic efforts should be directed to build the 
patient's trust, to focus on positive traits, and to enhance her confidence and self-esteem. 

Source: MCMI-III and Millon are trademarks of DICANDR1EN, Inc. MCMI-IH interpretive report 
copyright 1994 by DICANDRIEN, Inc. All rights reserved. Reprinted by permission of Pearson 
Assessments, NCS Pearson, Inc. 



Millon Adolescent Clinical Inventory (MACI) 



The Millon Adolescent Clinical Inventory (MACI) (Millon, Millon, & Davis, 1993) 
is a 160-item inventory that requires a 6th-grade reading level. The MACI is de- 
signed to assess an adolescent's personality, along with self-reported concerns and 
clinical syndromes using 27 content scales and 4 response bias scales: Personality 
Patterns, Expressed Concerns, Clinical Syndromes, and Modifying Indices. For fur- 
ther breakdown of the scales, see Table 7.14. These scales coordinate with descriptive 
characteristics in recent DSM classifications (Millon et al., 1993). The test was 
normed using 13- to 19-year-olds. The development sample (n = 579) was 54% male 
and 46% female. The two cross-validation samples (n = 139, n = 194) were 53% and 
65% male, respectfully, and 47% and 35% female, respectively (Millon et al., 1993). 
Over 1,000 adolescents and their clinicians from 28 states and Canada were involved 
in the development of the MACI. 

The MACI usually requires about 20 to 40 minutes to complete and can be 
scored by hand in about 20 minutes, sent to the publisher by mail, or scored by com- 
puter onsite in about 5 minutes (Erford, 2006). Sample items include "I have an at- 
tractive body," "I go on eating binges frequently," and "I enjoy fighting." Internal 
consistency reliabilities for the Development Sample range from 0.73 for the Scales 
D (Sexual Discomfort) and Y (Desirability) to 0.91 for Scale B (Self-Devaluation). 
Except for Scale W (Reliability) scores, raw scores are converted to Base Rate Scores 
(BRS) for interpretation. Different BR transformation tables are used depending on 
the age and gender of the adolescent and are adjusted to a value that falls between 1 
and 115 (Millon et al., 1993). Internal consistencies for the two cross-validation 
samples combined ranged from 0.69 for Scale D (Sexual Discomfort) to 0.90 for 
Scale B (Self-Devaluation). Internal consistency coefficients for the development 
sample Personality Patterns scales ranged from 0.74 for Scale 3 (Submissive) to 0.90 
for Scale 8B (Self-Demeaning). Test-retest reliability results ranged from 0.57 for 
Scale E (Peer Insecurity) to 0.92 for Scale 9 (Borderline Tendency) for a 3- to 7-day 
interval. The median stability coefficient is reported as 0.82 (Millon et al., 1993). 
Criterion-related validity correlations are moderate in magnitude (Erford, 2006). 

The MACI is designed for use with emotionally disturbed adolescents ages 
13-19 years as an aid to help identify, predict, and understand some of the psycho- 
logical difficulties this group experiences. Since this is a Level C instrument, exam- 
iners are required to have "a graduate degree in psychology or a related field, or ap- 
propriate licensure, a course in testing theory, coursework in personality theory, or 
abnormal psychology, and appropriate experience under supervision" (Erford, 2006, 



254 Chapter 7 



Table 7.14 Response bias scales and content scales 



Personality patterns 



Expressed concerns 



Clinical syndromes 



Modifying indices 



Scale 1 — Introversive 
Scale 2A — Inhibited 
Scale 2B— Doleful 
Scale 3 — Submissive 
Scale 4 — Dramatizing 
Scale 5 — Egotistic 
Scale 6A — Unruly 
Scale 6B — Forceful 
Scale 7 — Conforming 
Scale 8A — Oppositional 
Scale 8B — Self-Demeaning 
Scale 9 — Borderline tendency 



Scale A — Identity diffusion 
Scale B — Self-devaluation 
Scale C — Body disapproval 
Scale D — Sexual discomfort 
Scale E — Peer insecurity 
Scale F — Social insensitivity 
Scale G — Family discord 
Scale H — Childhood abuse 



Scale AA — Eating dysfunctions 
Scale BB — Substance-abuse proneness 
Scale CC — Delinquent predisposition 
Scale DD — Impulsive propensity 
Scale EE — Anxious feelings 
Scale FF — Depressive affect 
Scale GG — Suicidal tendency 



Scale X — Disclosure 
Scale Y — Desirability 
Scale Z — Debasement 

Other 

Scale W — Reliability 



p. 41). Strengths of the MACI include ease of scoring and interpretation, personal- 
ity variables mapped to DSM personality disorders, appropriateness of concerns fre- 
quently expressed by emotionally disturbed adolescents, and identification of impor- 
tant clinical syndromes (Retzlaff, 1995). Clinicians using the computer interpretive 
report are likely to find the response cover sheet, printout, histographic display, nar- 
rative, and list of correlated Axis I and II entities useful (Stuart, 1995). Weaknesses 
of the MACI include the underrepresentation of participants ages 18-19 years in the 
normative samples (Stuart, 1995). The manual clearly stated that use of the MACI 
for any population outside the 13—19 age designation would be inappropriate 
(Millon et al., 1993). There is a lack of item and scale specificity because 160 items 
attempt to score 30 scales (Retzlaff). Also, overrepresentation of Whites (78.8%) 
(Stuart) and males in the normative sample may make it less appropriate for use with 
some populations (Millon et al., 1993). Lastly, it may not be particularly useful as a 
screening level test for the general adolescent population because the norming sam- 
ple did not include adolescents not identified as patients in treatment programs 
(Stuart, 1995). Overall, the best use of the MACI is for hypothesis generation and 
validation, outcomes assessment, and screening for pathology, not for diagnosis. 



Achenbach System of Empirically Based Assessment (ASEBA) 



The Achenbach System of Empirically Based Assessment {ASEBA) (Achenbach & 
Rescorla, 2001) is a series of multi-informant inventories for rating the behavior of 
children ages 1 72— 5 years and another for children ages 6-18 years. Each is designed 
to assess competencies, adaptive functioning, and other problems through the use of 
four forms: Child Behavior Checklist (CBCL/ 1 ! /j-5) and the CBCU6-18) (i.e., a par- 
ent report form), Youth Self-Report (YSRior children ages 11-18 years), and Teacher's 
Report Form (TRE). Items measure six AW-oriented scales that include Affective 
Problems, Anxiety Problems, Attention Deficit/Hyperactivity Problems, Conduct 



Clinical Assessment 



255 



o 

£ a 

o u 

ii 'So 

o o 

X o 

S 5 



,o .3 

£ as 



*H J 



<e U < 



u bo 

o < 



2 -c 



s u 



Qz 



UJH8HUrt;j laOKEdJiJ 




































= -2 










a: 


'2 

E 

u 




i 

5 

f 


Q 


3 

I 


3 


■a 
a 

1 

-3 


^ - 1 5 

5 IS < <= ^ ^ 


c 


§ 1 


J 












s 

•6 


q 


o 
-d 


ca 




pa 


J<tt!(/5intflcof- 


H > 

— sd 

o o 


1 1 / 














~- 




(N 


rj 


rs 


m 


(*^ 


^*rt\or~-coooo>ON 




/ 














- 


o 


o 


o 


o 


o 


o 


oooooooo 


o o 


1 1 / 




































"3 
































/ 


B — 
.2 e 
S w 

V3 CS 

O H 


E 
33 

p 




T 


3* 


X 


3 


i 

o 
X 
J3 


"3 

CO 

Xi 


E 
o 


1 












\ 


a. 


0- 










| 


5 


q 


f 2 ? 












1 1 \ 


O 












a 


rn 


«c 


ON 












\ 














O 


o 


O 


« 


N 












\ 




>> 
































































\ 


§* 


[^ 


E 








a 


B 




u 


> 


X! 










1 1 1 


= 1 
5* 


99 

u 

a. 


V 

s 

'- 
a. 


PI 


ir, 


X 

Ifl 




a 
z 
U 


3 

33 


"3 
E 


a 

s 

a 


9 

H 
en 


-3 
3 

3 

a 








1 












■*t 


o6 




5 


c- 


ON 










1 














N 


- 


o 


o 


o 


o 


o 








1 1 


•J i 










e 




<a 




-C 


x> 


















O 


© 


■/• 








c 














1 


E -2 
2 2 






•r, 


n 

V 




■3 


3 


£ 


£ 
a 

J2 


1 

G 


i 

o 








1 1 / 












< 


r 


2 


U] 


Efl 


Cfl 


> 
























X) 




-z 






be 








/ 














^o 


^D 


£ 


■c 


-C 


« 










/ 














o 


O 


O 


o 


o 


o 


o 








1 1 / 














c 




"o 
5 
















I 


u a» 






<-« 


-T 


Hi 


■S 

c 




o 


a 


*3 


E 

O 










1 1 A 


11 






ir, 


•c 


c. 


§ 


1 


| 


1 










1 1 s 


«£ 












p 


L^ 


UL 


z 


LU 
























0< 


c 




C 












/ 














— 


<N 


en 


^t 


«n 


~ 










1 1 / 














- 


O 


o 


o 


o 


- 










\A 




























» e -s « -5 






/\ 


■- ^ 








U 


b 






la 




<fl 






.SleepsLes 
.SleepsMo 
.TalkSuici 
O.SleepPrc 
2.Underac 
3.Sad 






J ' 


g 3 

< A. 






» 




3* 
A 


jjj 


V 


eg 


1 


u 


£» 


Tr 


















iS 


u 


aq 




y 


^ 


3 

a 


H 




















-r 




1- 






•* 


O r~ — o O O 




















in 


"" , 


*■* 


in 


m 


\n 


«n 


r- r- o\ — — — 




















- 


o 


o 


o 


o 


o 


- 


r4 o O r-4 — rj 






1 1 1 1 1 1 1 1 1 

Oinoi/">Oi/">Oi/"iou"i< 


D 
































oiTicncocor^^to^iriL 


n 








































1) 


























H COUOKH 








O 

u 
en 




V 






























1) 
fa 


E 
































C 


V 






























1 
a 


u 

.A 


u 

fa 






























H 


.- 


a. 























<u < 

8 1 



S °8 



T3 






3 
DO 



256 Chapter 7 



OJm2mO<J 



X o 

«5 



£ 5 

~ a 

'5 So 

<•- .. vt ^ 

g " S IB 

© 3 o C 

O J= Ml u 

C/J U < > 
* 

^ o 

c« - 



8 55 

Hi 



5,3a 
op =3,9 

J. a-P 



SI 



U a 

8 1 



II 
81 

ai 



S g 
§ 2 "> 




t — i — I — r 



Q 


u-1 


o 


1/1 


o 


in 


o 


in 


a 


in 


Q 


o 


01 


<T~ 


CO 


CO 


r» 


i> 


CD 


CD 


in 


in 



tlUOPIU 



< 


s 


fi 


„, 












E 




'•> 






o 
p 


O 
p 


E 




Q 


<« 


3 
< 


£ 


E 

c 


s 




3 



M 




Q 


Q 


Q 


5 


Q 




Cfl 


a 
<55 


O 

s 


CO 


V) 


£ 


B 
H 


9 


c 


— 


CN 


<-, 


r- 


t^ 


>: 


>© 


r^ 


>c 


O 


Tf 


»n 



CO — — CN (N 



oo — — oooooo 






r-i — — — <n O O 



CC ffi J Q. 



a 


g 


-£ 


E 
o 


3 
O 




< 


— 








§ 












3 








# 




co 


CO 


CO 


Cfl 


CO 


r^ 


r\ 


m 


— 


cs 


a 









o c o 



oooco — oooooooo 



UCflC^Q.O'Sb 

0(rt(jQi££c« 
Vdrnr^ — — oo o 
oo — — •— ^lOr-or; 



ig | % % ? < 

^ 5 -g « i i « 

£ b 5 J2 a. a. v 

™ « S .H u C « 

_ X — — S A ~-± 

06 o so oo cS d « 



s s - w 



OOOOOOOOOtNOOO — 



c — 



°- ■= ^ -= 



< 


=1 
a 


O 
o 


a 
■3 


■a 
s 


3 E 














□ 




-i 








z 


>£■ 


o 


< 


H 


Z U 



o o c c 



< I Z W 



c E 

J2 o 

to CO 



TtvO^CsO^NC^OO 



occ — ooooooo 







5 


H 


> 
□ 








2 










_-: 


« 


■b 


c 


g 


o 


K 


e 


J 




< 




£ 


CO 


^* 




m 






rs 




c> 


m 


b 


b 












r* 









, 


x: 
V] 


X 


| 


■8 

c 


i 


s 


1 


a 





L^ 


£ 




i 


= 


z 


u. 


s 




o 




(N 


r*i 


y 


IT 


z: 


fN 




r i 


m 


m 


ro 


". 


r'. 


-T 


i--, 


tn 


( ■ 






<_) 
I/) 

■o 

c 
a; 



on 
Q 

§ 
oo 



CO 



H 



.=■ o 
U_ cr 



Clinical Assessment 257 

Problems, Oppositional Defiant Problems, and Somatic Problems (Achenbach & 
Rescorla, 2001). Informants are prompted to rank items (Not True), 1 (Somewhat 
or Sometimes True), or 2 (Very True or Often True) and are invited to describe sev- 
eral selections in detail. Item prompts include "Physically attacks people," 
"Inattentive," and "Wets the bed." The ASEBA can be completed by hand, on com- 
puter, or online via the ASEBA Web-Link (vAvw.aseba.org), which permits access to 
informants in remote areas. This test takes about 1 5-20 minutes to complete and 
can be scored by hand or computer. 

Test-retest reliability coefficients for intervals of 8-16 days were mostly in the 
0.80s and 0.90s for subscales of the CBCL/6-18 and ranged from 0.91 to 0.95 for 
Total Competence, Total Adaptive Functioning, and Total Problems (Achenbach & 
Rescorla, 2001). "Percentiles and normalized T scores are based on national proba- 
bility samples of children who had not received mental health, substance abuse, or 
special education services for major behavioral, emotional, or developmental prob- 
lems in the preceding 12 months" (Achenbach & Rescorla, 2001, p. 80) (see Figures 
7.7 & 7.8 for sample profile forms). The ASEBA national normative sample (n = 
9,052) included children from 40 states and the District of Columbia. Clinicians 
may find that routine use of the ASEBA forms for intake, screening, and evaluations 
gleaned from parent, teacher, and self-reports provide a broad picture of the client 
and can serve as a starting point, or springboard, for discussing pertinent issues in the 
clinical interview (Achenbach & Rescorla, 2001). Watson (2006) reported that it is 
a psychometrically sound instrument but has some weaknesses, especially concern- 
ing the scales for younger children. In addition, the directions and manuals are im- 
proved over the original versions. 

The ASEBA system is one of the best behavioral assessment systems currently 
available (Salvia & Ysseldyke, 2004) and can be a helpful adjunct to functional be- 
havioral analysis (FBA) (Gresham, Watson, & Skinner, 2001). While the CBCL, 
TRF, and YSR are the most frequently used components of ASEBA, additional com- 
ponents include the Direct Observation Form (DOF); a Young Adult Self-Report 
( YASR) for adults ages 1 8-30 years; a Young Adult Behavioral Checklist (parent re- 
port); and a Semi-structured Clinical Interview for Children and Adolescents (SCICA) 
for use with children ages 6-12 years. The ASEBA is a Level B instrument. 



Personality Inventory for Children-Second Edition {PIC-2) 



The PIC-2 (Lachar & Gruber, 2001) is a multidimensional clinical measure of be- 
havioral, emotional, and cognitive status for children ages 3-16 years. It is a screen- 
ing instrument that is usually completed by the parent. The PIC-2 has 275 items in 
its standard format and contains 12 psychological scales with various subscales. The 
PIC-2 also contains an abbreviated behavioral summary of 96 items. The psycholog- 
ical scales include Cognitive Impairment, Impulsivity and Distractibility, 
Delinquency, Family Dysfunction, Reality Distortion, Somatic Concern, 
Psychological Discomfort, Social Withdrawal, Social Skills Deficits, as well as three 
Response Validity scales. Parents are asked to respond to the items with True or 
False answers. The standardization sample generally conformed to U.S. population 



258 Chapter 7 



demographics with the exception of an overrepresentation of Whites and underrep- 
resentation of Hispanics. There was also an overrepresentation of biological parents 
and an underrepresentation of single parents (Erford & McKechnie, 2006). 

No overall composite score is derived, but there are three separate composite 
scale scores: Externalization-Composite, Internalization-Composite, and Social 
Adjustment Composite. Raw scores can be converted to T scores when the Student 
Behavior Survey, a profile form, is completed. Test-retest reliability coefficients 
ranged from r = 0.82 to 0.92 and internal consistency coefficients ranged from r = 
0.81 to 0.92 for the interpreted scales. Criterion validity studies were conducted but 
did not use other commonly used instruments (Erford & McKechnie, 2006). 
However, because this new version of the PIC-2 is a major revision of the original, 
clinicians should be cautious in making diagnostic decisions using the PIC-2 until 
further research and diagnostic validity studies have been conducted. The PIC-2's 
primary benefit continues to be the assessment of parental perceptions of childhood 
behavioral and clinical difficulties. 

Devereux Scales of Mental Disorders [DSMD) 

The DSMD (Naglieri, LeBuffe, & Pfeiffer, 1 996) is used to assess behaviors related 
to psychopathology. It can be administered both to individuals as well as groups of 
children ages 5-18 years in about 15 minutes. There are two forms of the DSMD, 
the child form and the adolescent form, and each can be rated by parents, teachers, 
and other appropriate professionals. There are 110 items on this inventory, which 
measures nine constructs, including Conduct, Attention-Delinquency, Anxiety, 
Depression, Autism, Acute Problems, Internalizing Composite, Externalizing 
Composite, and the Critical Pathology Composite. Responses are based on a 5-point 
scale ranging from Never to Very Frequently. Raw scores can be converted into T 
scores and percentile ranks. Standardization samples generally conformed to U.S. 
population demographics for both children and adolescents (Cooper, 2001). 

Alpha coefficients were reported at about r = 0.90 or higher, and test-retest re- 
liability coefficients were in the 0.80s and 0.90s. Interrater reliability coefficients be- 
tween parents and teachers were in the 0.40s and 0.50s. This is not surprising given 
that teachers and parents observe the child's behavior in two distinct ecological con- 
texts (i.e., school and home). Validity studies yielded adequate results on all levels, 
with items showing a strong congruence to D5M-/Kcriteria for the specific behav- 
ior disorders in question (Peterson, 2001). There is some dispute in the composition 
of types of participants used in the reliability and validity study samples and as to 
whether the type of subjects might have caused elevated coefficients. Even so, there 
is substantial normative data lor the DSMD, and it has emerged as a good assess- 
ment for certain antisocial behaviors in children and adolescents. 

Children's Depression Inventory {CD I) 

The CD/ (Kovacs, 1992) is a self-report inventory used to assess children's depres- 
sion. Parent and teacher versions are also available. It can be administered both in- 
dividually as well as to small groups ol children ages 8-17 years in about 10 to 15 



Clinical Assessment 259 

minutes. This assessment' contains 27 items that cover all nine symptoms for a major 
depressive syndrome in children as presented in the DSM-III-R. Children's responses 
are based on a 3- point scale, from to 2, with 2 being the most severe (Kavan, 
1992). Limited normative data are available for the CDI because it was not nation- 
ally standardized. The standardization sample was inadequately small and geograph- 
ically restricted (Knoff, 1992). Scoring was simple and convenient, using the 
QuickScore™ forms. 

Reliability and validity data are also questionable. Although coefficient alphas 
from two different samples reported in the manual were consistent at r = 0.86 and 
0.87, respectively, many empirical studies yielded inconsistent results. Item-total 
score coefficients ranged from r = 0.08 to 0.62. A one-month test-retest reliability 
coefficient was r = 0.43, while a nine-week test-retest reliability coefficient was r = 
0.84. Regarding validity, the CDI had adequate correlations with the Revised 
Children's Manifest Anxiety Scale but yielded low correlations with Coopersmith Self- 
Esteem Inventory (Kavan, 1992). The CDI has demonstrated good discrimination be- 
tween clinical and nonclinical groups (Carey, Gresham, Ruggerio, Faulstich, & 
Engart, 1987; Hodges, 1990). It is obvious that more empirical data need to be col- 
lected with regard to the CDI and it should not be used as a diagnostic tool 
(Craighead, Curry, & Ilardi, 1995; Fristad, Emery, & Beck, 1997; Knoff, 1992). 
Admittedly, the construct of depression is more difficult to accurately assess in chil- 
dren than adults because depressive symptoms are more transient in younger clients. 
In spite of this, the CDI is easy to administer and score and may be helpful during 
initial clinical assessment (Kavan, 1992). It is, perhaps, the most commonly used 
screening tool for childhood depression (Craighead et al., 1995; Fristad et al., 1997). 



Reynolds Adolescent Depression Scale-Second Edition (RADS-2) 



The Reynolds Adolescent Depression Scale — Second Edition {RADS-2) (Reynolds, 
2002) is a 30-item self-report inventory for adolescents ages 1 1-20 years and is de- 
signed to assess symptoms associated with depression. Items measure four subscales: 
Dysphoric Mood (DM, 8 items); Anhedonia/Negative Affect (AN, 7 items); 
Negative Self-Evaluation (NS, 8 items); and Somatic Complaints (SC, 7 items). 
Sample items include "I feel lonely," "I feel like running away," and "I feel like noth- 
ing I do helps anymore." The items are scored on a 4-point Likert scale (Almost 
Never, Hardly Ever, Sometimes, or Most of the Time) (Blair, 2005). The RADS-2 is 
a Level B test and takes about 10 minutes to administer, score, and interpret. The 
normative restandardization sample {n = 3,300) for the RADS-2 was comprised of an 
equal number of adolescent males and females living in the United States and 
Canada. Compared to the 2000 U.S. Census, this sample was considered ethnically 
diverse and heterogeneous in socioeconomic composition (Reynolds, 2002). 

Raw scores are summed to derive a Depression Total score. The Depression 
Total and four subscales can be converted to a T score or percentile rank according 
to gender, age group, and gender by age group norms. More than 20 years of research 
supports the psychometric qualities of the RADS-2, and the new version is found to 
continue the tradition of a sound instrument (Blair, 2005). Internal consistency of 
the Depression Total score was r = 0.92 (Reynolds, 2002). Test-retest reliability (two 



260 Chapter 7 



weeks) was r = 0.86 for the Depression Total score (Reynolds, 2002). Criterion-re- 
lated validity studies resulted in moderate to high correlations with other measures 
of depression and indicated the RADS-2 is best used as a screening level test for de- 
pression (Erford, 2006). Overall, "the RADS-2 is cost- and time-efficient, easy to use, 
and a reliable and valid screening instrument for adolescents with symptoms of de- 
pression" (Erford, 2006, p. 58). 

The RADS-2 is one of the only depression screening tests validated for use with 
adolescents (Brooks & Kutcher, 2001), and its recommended clinical cutoff of T = 
61+ has been shown to identify clinically severe symptoms of depression on the 
Hamilton Depression Rating Scale (HDRS) (Reynolds & Mazza, 1998). The RADS-2 
is a screening test and should not be used to supplant use of a clinical interview 
(Davis, 1990) and is not a substitute for an interview of suicidal ideation (Reynolds, 
2002). Volpe and DuPaul (2001) also indicated the RADS-2 shows some usefulness 
in monitoring the effects of treatment and as one component in a comprehensive di- 
agnostic approach for depression. 

Symptom Checklist-90-Revised (SCL-90-R) 

The SCL-90-R (Derogatis, 1992) portrays patterns of psychological symptoms in 
patients and nonpatients. The SCL-90-R can be administered to groups or indi- 
viduals ages 13 years to adult in about 15 to 20 minutes. Symptoms are measured 
on 12 constructs: Somatization, Obsessive-Compulsive, Interpersonal Sensitivity, 
Depression, Anxiety, Hostility, Phobic Anxiety, Paranoid Ideation, Psychoticism, 
Global Severity Index, Positive Symptom Distress Index, and Positive Symptom 
Total. There are a total of 90 items on this inventory. Clients are asked to rate their 
level of discomfort with a particular problem (Not at all) to 4 (Extremely). Norms 
were constructed on several standardization samples, including psychiatric out- 
patients, psychiatric inpatients, adult nonpatients, and adolescent nonpatients 
(Pauker, 1985). 

Pauker (1985) and Payne (1985) asserted that the original SCL-90 manual re- 
ported satisfactory results for internal consistency (r = 0.77-0.90) and test-retest re- 
liability coefficients (r = 0.78-0.90, one week apart). The few validity studies con- 
ducted portrayed comparable levels to other self-report inventories; however, more 
research is needed in this area. Other criticisms included a lack of clarity in the man- 
ual and the possible limitations inherent in requiring an 8th-grade reading level 
when using an inventory with adolescents ages 13 years and older. Strengths of the 
SCL-90-R are the quick administration and scoring procedures as well as its straight- 
forward scoring criteria. 

Beck Depression Inventory-Second Edition (BDl-ll) 

The Beck Depression Inventory — Second Edition {BDI-II) (Beck et al., 1996) is a 21- 
item self-report inventory used to assess the severity of depression of individuals ages 
13 years or older. Each item is formatted on a 4-point scale (i.e., ranging from to 
3 in terms of severity) and indicates a particular depressive symptom occurring dur- 



Clinical Assessment 261 

ing the past two weeks. The BDI-II has gone through several revisions since its orig- 
inal publication. The last major revision changed the instrument from the BDI-IA 
to the BDI-II in 1996 to correspond with the criteria for depressive disorders in the 
Diagnostic and Statistical Manual of Mental Disorders — Fourth Edition (DSM-IV) 
(American Psychiatric Association, 1994). On revision of the BDI-II, four items (i.e., 
Weight Loss, Body Image Change, Somatic Preoccupation, Work Difficulty) were 
replaced with four new items (i.e., Agitation, Worthlessness, Concentration 
Difficulty, Loss of Energy). In addition, two items (i.e., Changes in Sleeping Pattern 
and Changes in Appetite) were revised by creating seven optional scales representing 
differences between increases and decreases of severity. Paper-and-pencil record 
forms, scannable record forms, and Spanish record forms are available. Current cost 
information and online order are available on the website of Harcourt Assessment, 
Inc. (2004b). The BDI-II takes 5 to 10 minutes to complete. Although the BDI-II 
is self-administered, a trained examiner can read the questions aloud if needed. 
Administration and interpretation qualification is Level C (i.e., requires doctoral- 
level training in psychology, education, counseling, or related fields, or licensure or 
certification as a professional counselor or other psychological professional). Hand 
scoring and computer scoring are available. Summing all the responded scales yields 
a total score (maximum is 63). A total score of 14 or above indicates the possibility 
of depression. Although the responses for items 2a and 2b (i.e., Changes in Sleeping 
Pattern and Changes in Appetite) are not considered in calculating a total score, they 
should be considered in the diagnosis of depression. 

The normative sample for the BDI-II consisted of 500 outpatient clients from 
four different psychiatric clinics in urban and suburban areas in the United States, 
and 120 students from one college in Canada (Farmer, 2001). Scores on the BDI-II 
have shown to be reliable (e.g., internal consistency, test-retest reliability) and valid 
(e.g., content validity, construct validity, factorial validity) (Beck et al., 1996). 



Beck Anxiety Inventory (BAI) 



The Beck Anxiety Inventory (BAI) (Beck et al., 1988; Beck & Steer, 1993) is a 21- 
item self-report instrument used to assess the severity of anxiety of individuals ages 
17 years or older. Each item on the BAI'is formatted on a 4-point scale (i.e., ranging 
from Not at All=l to Severely; "I could barely stand it") and indicates symptoms re- 
lated to anxiety during the past week. Paper-and-pencil record forms, scannable 
record forms, and Spanish record forms are available. Current cost information and 
online ordering information are available on the website of Harcourt Assessment, 
Inc. (2004b). 

Like the other Beck instruments discussed, the BAI is self-administered, but a 
trained examiner can administer it verbally. The BAI takes 5 to 10 minutes to com- 
plete. The administration and interpretation qualifications for this instrument are 
also Level C. Hand scoring and computer scoring are available. Summing all re- 
sponses yields a total score with a maximum of 63. 

The first normative sample for the BAI consisted of 810 outpatient clients with 
affective and anxiety disorders. Subsequent studies were conducted to determine the 



262 Chapter 7 



reliability and validity of scores (for detailed development procedures, see Beck et aJ., 
1988). Beck et al. demonstrated high internal consistency and sufficient test-retest 
reliability for scores on the BAI. The test authors also demonstrated convergent va- 
lidity and discriminant validity. For example, the BAI was moderately correlated 
with the Hamilton Anxiety Rating Scale — Revised (HARS-R) and the Cognition 
Checklist Anxiety subscale (CCL-A). Beck et al. (1988) also demonstrated factorial 
validity as the BAI consisted of the two factors: (1) somatic symptoms and (2) sub- 
jective anxiety and panic symptoms. However, Osman, Barrios, Aukes, Osman, & 
Markway (1993) discovered four factors of the BAI: (1) Subjective, (2) 
Neurophysiological, (3) Automatic, and (4) Panic. 

Overall, establishing an ability to discriminate between anxiety and depression 
(i.e., discriminant validity) is one of the most critical useful aspects of the BAI (Beck 
at al., 1988). Thus professional counselors may find this tool useful for clarifying the 
presenting problem and formulating effective treatment plans. 



Beck Scale for Suicide Ideation (BSSI) 



The Beck Scale for Suicide Ideation {BSSI) (Beck, Kovacs, & Weissman, 1979) is a 
21 -item self-report inventory used to assess the severity of suicide ideation of indi- 
viduals ages 17 years or older. Suicide ideators are defined as "individuals who cur- 
rently have plans and wishes to commit suicide but have not made any recent overt 
suicide attempt" (Beck, Kovacs, & Weissman, 1979, p. 344). Beck et al. (1979) first 
developed a 19-item Scale for Suicide Ideation (SSI) to assess suicide intention. An 
examiner completes the SSI by asking each item in a semi-structured interview for- 
mat and recording the client's responses. The SSI was revised into the BSSI in 1991 
through the creation of a self-report format. Paper-and-pencil record forms, 
scannable record forms, and Spanish record forms are now available. Current cost 
information and online ordering information are available on the website of 
Harcourt Assessment, Inc. (2004b). 

The BSSI consists of the three parts: (1) Items 1 through 5 (i.e., attitudes to- 
ward living and dying); (2) Items 6 through 19 (i.e., suicide ideation and anticipated 
reaction of the ideation); and (3) Items 20 and 21 (i.e., the number of past suicide 
attempts and the seriousness of intention in the last suicide attempt) (Stewart, 
1998). Each item is formatted on a 3-point scale ranging from to 2 in terms of 
severity. The BSSI takes 5 to 10 minutes to complete. The BSSI is self-administered, 
but a trained examiner can read the items aloud if necessary. Administration and in- 
terpretation qualifications are Level C. Hand scoring and computer scoring are avail- 
able. A total score is calculated, with a maximum of 42. However, because the test's 
authors do not provide a cutoff score, an examiner should cautiously analyze a total 
score and client responses to each item (called "critical item analysis") to examine 
suicide risk (Stewart, 1998). 

The normative sample for the BSSI consisted of 178 adults (126 inpatient and 
52 outpatient clients) who were receiving psychiatric services and were identified as 
suicide ideators. Although scores have been reliable only lor (he first 1 l ) items, the 
BSSI has high internal consistency and moderate test-retest reliability (Stewart, 



Clinical Assessment 263 

1998). Also, the BSSIhas good construct validity. For example, the BSSI was signif- 
icantly correlated with the SSI (Stewart, 1998). Although the normative sample 
lacked adolescents, Steer, Kumar, and Beck (1993) demonstrated in their study using 
adolescent inpatients that the BSSI was positively correlated with a history of a past 
suicide attempt, the Beck Depression Inventory (BDI) (Beck et al., 1996), the Beck 
Hopelessness Scale {BHS) (Beck & Steer, 1993), and the Beck Anxiety Inventory {BAT) 
(Beck, Epstein, Brown, & Steer, 1988). 

Professional counselors should consider using the BSSI to assess the suicide risk 
of individuals who obtain a high score on the BHS, given that hopelessness may be 
a significant suicide indicator for adolescents and adults, rather than depression and 
anxiety (Beck et al., 1979; Steer at al., 1993). 



Substance Abuse Subtle Screening lnventory-3 {SASSI-3) 



The Substance Abuse Subtle Screening Inventory — 3 (SASSI-3) (Miller & Lazowski, 
1 999) is a self-report inventory used to assess the probability of substance depend- 
ence (e.g., alcohol or other drugs of abuse) of individuals ages 18 years or older. An 
adolescent version of the SASSI is also available. Paper-and-pencil record forms, 
computer versions, audiotape versions for individuals with reading problems, and 
the Spanish SASSI are available. Information on current cost and other SASSI prod- 
ucts and online ordering information are available on the website of the SASSI 
Institute (2004). 

The SASSI-3 consists of two parts, each of which is printed on a separate side of 
one test form. One part contains 67 items consisting of true-false questions regard- 
ing substance dependence. The other part contains 26 items (12 for alcohol use and 
14 for drug use) formatted on a Likert scale ranging from (Never) to 4 
(Repeatedly). For each of the Likert items, the client is asked to respond considering 
one of the following four time periods: entire life, past 6 months, 6 months before a 
critical event, or 6 months after a critical event. According to Miller (1997), the au- 
thor of the SASSI-3, there were three main changes from the SASSI-2 that increased 
accuracy: (1) A new scale, Symptoms (SYM), was created, which provides informa- 
tion regarding the client's substance use and the environmental impact of substance 
use on the client; (2) two items were eliminated because of reported discomfort by 
some users; and (3) the four time periods mentioned above were added to the Likert 
scale format. The SASSI-3 consists of 10 subscales and takes approximately 15 min- 
utes to administer (for details of subscales, see Juhnke et al., 2006; Pittenger, 2003). 
The subscales include Face Valid Alcohol, Face Valid Other Drug, Symptoms, 
Obvious Attributes, Subtle Attributes, Defensiveness, Supplemental Addiction 
Measure, Family versus Control Subjects, Correctional, and Random Answering 
Pattern. Administration and interpretation are Level B (master's level in psychology, 
counseling, or related fields, with certification or professional training in psycholog- 
ical assessment). An examiner scores the SASSI-3 using a scoring key and obtains a 
profile by plotting a raw score for each subscale; raw scores are converted into per- 
centile ranks and T scores (M = 50; SD = 10). Interpretation of the results is done 
according to decision rules provided in the test manual. 



264 Chapter 7 



Some researchers investigated reliability and validity of SA SSI-3 scores. 
Lazowski, Miller, Boye, and Miller (1998) found high test-retesr reliability, internal 
consistency, and criterion-related validity. However, there are some mixed results 
when using the SASSI-3 with special populations (e.g., clients who have a traumatic 
brain injury) For example, Arenth, Bogner, Corrigan, and Schmidt (2001) reported 
lower accuracy, sensitivity, and specificity in their study investigating the utility of 
the SASSI-3 to diagnose chemical dependence for individuals with brain injury. 
However, Arenth et al. concluded that the SASSI-3 was promising for individuals 
with brain injury, given that substance abuse strongly affects brain injury. Finally, the 
customer support from the SASSI Institute is excellent, often providing free profile 
consultations using an 800 number. 



Eating Disorder lnventory-3 (EDI-3) 



The Eating Disorder Inventory — 3 {EDI-3) (Garner, 2004) is an effective self-report 
inventory for assessing the attitudes, behaviors, and psychological traits related to 
Anorexia Nervosa and Bulimia Nervosa for individuals ages 12 years or older. The 
EDI-3 was revised from the original EDI published in 1984 and the EDI-2 (pub- 
lished in 1991). Anorexia Nervosa contains symptoms such as refusal to maintain a 
minimally normal body weight and fear of gaining weight, whereas Bulimia Nervosa 
contains symptoms such as binge eating, self-induced vomiting, misuse of medica- 
tions (e.g., diuretics, laxatives), and excessive exercise (APA, 2000). Paper-and-pen- 
cil record forms and computer versions are available. Current cost information and 
online ordering information are available on the website of Psychological Assessment 
Resources, Inc. (2004b). 

The EDI-3 contains 91 items, broken down into 12 scales (3 eating-disorder- 
specific scales and 9 general psychological scales that are highly relevant to eating dis- 
orders), each of which is formatted on a 4-point scale that helps to improve the reli- 
ability of some of the scales and provides a wider range of scores. In addition, the 
results yield six composite scores (Eating Disorder Risk, Ineffectiveness, 
Interpersonal Problems, Affective Problems, Overcontrol, and General Psychological 
Maladjustment) that are helpful when creating treatment plans, interventions, and 
treatment monitoring. The EDI-3 takes approximately 20 minutes to complete. 
Administration and interpretation qualification is Level A (4-year-college or univer- 
sity level in psychology, counseling, or related fields with certification or professional 
training in psychological assessment). Each subscale score is obtained by summing all 
the scores for the subscale. Plotting each subscale score on a profile and comparing 
the profile to norms yields the potential severity of an eating disorder. Norms are 
available for (a) patients with Anorexia Nervosa — Restricting Type; (b) patients with 
Anorexia Nervosa-Binge-Eating/Purging Type; (c) patients with Bulimia Nervosa 
only; and (d) Eating Disorders Not Otherwise Specified (Psychological Assessment 
Resources, Inc., 2004b). 

Scores on the EDI-3 have been found to be reliable and valid. According to 
the publisher (Psychological Assessment Resources, 20()4b), moderate to high com- 
posite reliabilities were reported for all the subscales except one (0.80s to 0. 1 )(M 



Clinical Assessment 265 

and test-retest reliability coefficients in the 0.90s were reported for most of the sub- 
scales. Psychological Assessment Resources, Inc., reports that a relationship exists 
between the EDI-3 and a wide variety of external instruments. With this new re- 
vision, a Referral Form, which is a shortened form of the entire inventory, is in- 
cluded. It is especially useful when trying to identify students who may be at risk 
for eating disorders. 



SUMMARY/CONCLUSION 



KEY TERMS 



Clinical assessment and proper diagnosis of mental disorders relies heavily on the 
professional counselor's knowledge of the DSM-IV-TR multiaxial diagnostic system 
and implementing effective and efficient interviewing and clinical testing proce- 
dures. This chapter has provided a wealth of introductory material to orient the pro- 
fessional counselor to each of these essential dimensions. 

Professional counselors generally make clinical decisions using either a statisti- 
cal model (based predominately on test scores) or a clinical judgment model (based 
predominately on counselor experience). A great deal of helpful information can be 
obtained from a clinical interview. Structured interviews ask a standard set of ques- 
tions and allow little variation from the standardized protocol. Such procedures often 
result in similar conclusions by different counselors. Unstructured interviews have 
no preset list of questions and allow maximum flexibility for counselor questioning 
and follow-up. But this flexibility means that different professional counselors using 
unstructured interviews frequently develop different conclusions. As a compromise, 
semi-structured interviews use a standardized set of questions but allow the profes- 
sional counselor flexibility to pursue important information that falls outside of the 
more structured format. Specialized types of interviews discussed in the chapter in- 
clude the intake interview and mental status exam. 

Sources of information about a client usually stem from four sources and can 
be recalled using the acronym LOST: life outcome data, observer ratings, self-re- 
port ratings, and test data. The chapter also explored general procedures for devel- 
opment of clinical and personality tests. Some tests are based on theories of per- 
sonality or clinical pathology, while others use empirical procedures such as factor 
analysis or empirical-criterion keying. This chapter has provided an overview of 
numerous clinical tests to familiarize the reader with instruments commonly used 
by professional counselors. 



clinical assessment hypothesis confirmation bias 

clinical judgment inattention 

DSM-IV-TR intake interview 

empirical-criterion keying life outcomes 

Global Assessment of Functioning mental disorder 

(GAF) multiaxial classification system 

hyperactivity-impulsivity observer rating 



266 Chapter 7 



self-fulfilling prophecy test data 

self-report ratings True Response Inconsistency (TRIN) 

semi-structured interview scale 

statistical decision-making model unstructured interview 

statistical models Variable Response Inconsistency 

structured interview (VRIN) scale 












t 



8 



Personality Assessment 

by Bradley T. Erford, Kathleen McNinch, and Carol Salisbury 



This chapter addresses the basic knowledge and skills required for personality 
assessment. Attention is given to trait approaches, especially the five-factor 
model, and to personality instruments based on trait approaches. In addition, 
an introduction to projective assessment is provided. Commonly used projective as- 
sessments are discussed from a classification framework, including association, pic- 
ture-story construction, verbal completion, choice arrangement, and production- 
expression techniques. 



WHAT IS PERSONALITY? 



Some people are described as having so much personality that they "ooze" with it, 
others as having "no personality at all." Still others are diagnosed with a "personality 
disorder." So what is this thing that appears to be so important to people that the 
services of professional counselors are sought to help assess, understand, and some- 
times even restructure it? You may not find it hard to imagine that experts do not 
agree on a definition of personality, what comprises it, or how best to measure it. 
Some believe personality is an all-encompassing construct that accounts for all of an 
individual's thoughts, feelings, and behaviors. Others view personality with a much 
narrower focus. The unfortunate (or fortunate) thing about science is that in order 
to study something, one needs to be able to define it. Since few agree on any one 
definition, the authors have chosen one that makes sense and which can serve as a 
springboard to a robust discussion on personality and its assessment. 

Piedmont (1998) defined personality as an intrinsic, adaptive organizational 
structure that is consistent across situations and stable over time. Note the four es- 



267 



268 Chapter 8 



sential facets of this definition. First, personality is intrinsic, meaning located within 
the individual, not imposed on the individual by the environment. Second, person- 
ality is an adaptive, organized structure that allows the individual to adjust (or not ad- 
just) to environmental, contextual demands. These demands are basically competing 
needs and desires that may come from inside or outside of the individual. Third, per- 
sonality is consistent across situations — that is, one's personal goals and world view re- 
main fairly constant from one situation to the next, even though ones behaviors or 
thoughts can be adapted in different ways. Finally, personality is stable over time. This 
should not be understood to mean that personality does not change over time, for it 
certainly does. But there is some lingering connection or thread that ties together 
one's functioning during childhood, adolescence, and adulthood — consistent 
themes, needs, and motivations. 

Importantly, personality should not be viewed as being good or bad, because its 
basic purpose is to help the individual adapt and survive in a given context. 
Personality is a dynamic structure that is shaped and contoured over time to allow 
the individual to adapt to environmental demands and contexts in such a way that 
individual needs, desires, and motivations can be expressed. Just as in physical devel- 
opment, one is born with an immature personality that grows over time and is in- 
fluenced by culture and by environmental events. Personality helps one to perceive 
and interpret both the internal and external world and to select goals to pursue. 
Importantly, while personality does change over time, most of the change occurs 
during childhood, adolescence, and young adulthood. Indeed, there is overwhelm- 
ing evidence that one's personality is essentially stable by about the age of 30 years 
(Piedmont, 2006), barring major transformative events (e.g., religious conversion, 
significant trauma, intensive psychotherapy). 



THE PURPOSE OF PERSONALITY ASSESSMENT 



In general terms, the purpose of personality assessment is to help the professional 
counselor and client understand the client's various attitudes, characteristics, inter- 
personal needs, and intrinsic motivations in order to gain insight into current events, 
activities, and conflicts and also to generalize this understanding to new situations 
clients will encounter on their own, both now and in the future. In more specific 
terms, personality assessment has the same purposes as most other types of assess- 
ment, as discussed in Chapter 1: screening, diagnosis, placement, treatment plan- 
ning, and outcomes evaluation. While diagnosis may seem out of place in the con- 
text of personality as defined above, one should bear in mind the existence of 
personality disorders. Personality assessment can play a crucial role in identifying in- 
dividuals with some personality disorders. Professional counselors must be cognizant 
of which purpose is being pursued, because of all the types of assessment instruments 
available to professional counselors, structured and unstructured personality instru- 
ments have the widest variability in terms of psychometric quality and usefulness; 
thai is, some are extremely well developed and well studied, while others lack virtu- 
ally any empirical support or rigor. As a result, experienced clinicians approach the 
task of personality assessment with great seriousness and caution. 



Personality Assessment 269 

The two most common approaches to personality assessment are the (struc- 
tured) trait approach and the (unstructured) projective approach. The discussion of 
each approach and commonly used tests based on each approach make up the re- 
mainder of this chapter. 



TRAIT APPROACHES TO PERSONALITY ASSESSMENT 



Most personality tests measure traits or states (many measure both, of course), and 
it is sometimes helpful to consider traits and states as two ends of the same contin- 
uum. Traits are enduring, statistically derived dimensions used to explain personal- 
ity characteristics (e.g., introversion, agreeableness), while states are generally more 
transient or situation-dependent facets of personal adjustment (e.g., anxiety, self- 
confidence). Some measures, such as the State-Trait Anxiety Inventory for Children 
(Spielberger, 1973), aim to differentiate between the presence and importance of 
these two ends of the continuum. Client states are important for professional coun- 
selors to understand. They are often relevant to clinical diagnosis and often serve as 
the impetus for clients to actually seek counseling services. For example, many clients 
endure a life of anxiety or sadness but will only seek treatment when they experience 
a panic attack or major depressive episode. Acute anxious or depressive reactions are 
(generally) short-lived occurrences that result from situational events and/or internal 
physiology, not long-term conditions that stem from personality characteristics. 
Thus states are important, but because of their unpredictability and transience, they 
provide little help to clients and professional counselors who seek to understand and 
predict a client's likely pattern of cognitive, affective, and behavioral functioning. 
Thus most structured personality assessment deals with the identification of the 
more enduring personality traits to understand and predict human behavior. 

Unfortunately, social scientists who study traits disagree on a standard defini- 
tion to about the same degree that they disagree on a definition of personality. 
Personality traits are certainly not physical structures, although pseudoscientific ap- 
proaches during the past several centuries have espoused just that. For example, phys- 
iognomy is the study of personality through determining a person's physical charac- 
teristics. Thus the shape of one's nose may be used to determine personality 
characteristics: A pointed nose resembling a dog's snout would represent tenacity and 
faithfulness, and a large, rounded nose resembling a pig's snout would represent 
slovenly, piggish characteristics (Sax, 1997). Phrenology was a 19th-century system 
for studying the physical characteristics of the skull (i.e., protrusions or depressions), 
which were believed connected to functions within the brain. This theory espoused 
that the brain center responsible for a specific ability would "grow out" (i.e., pro- 
trude) when highly developed, or "sink in" (i.e., depress) when underdeveloped. 
Thus phrenologists of that era were quite confident that they could identify abilities 
such as concentration and secretiveness, as well as several dozen other characteristics. 
Additional pseudoscientific approaches include numerology, astrology, and palm- 
istry. None has received support from the scientific community. 

While the study of traits has a long history of pseudoscientific attempts, it has 
been studied scientifically for only a little more than half a century. In the historical 



270 Chapter 8 



evolution of our understanding of traits, Gordon Allport (1937) attempted to un- 
derstand traits as rational dimensions that underlie the thousands of words people 
use to describe each other. In one study, Allport & Odbert (1936) searched the dic- 
tionary for descriptive words and identified more than 18,000 words that could de- 
scribe human personality characteristics. They next whittled that list down to about 
4,500 by eliminating synonyms and by retaining descriptors of stable characteristics 
(remember, traits are enduring). But 4,500 is still a huge number of personality 
traits. The advent of new statistical techniques (i.e., factor analysis) and high-speed 
computers spurred further attempts to identify and understand the number of di- 
mensions, or component traits, that underlie personality. Today, there are hundreds 
of personality tests that purport to measure one or more personality traits. But until 
recently, there was little agreement over the number of factors or traits that explained 
human personality. For example, Cattell, Cattell, and Cattell (1993) developed the 
1 6 Personality Factors inventory (16PF). Others have determined that more than 100 
personality traits may exist. 

However, recent well-designed research and instrumentation by Costa & 
McCrae (1990, 1992) have helped to integrate much of the disparate research on 
personality traits conducted over the past half century into a model with substantial 
empirical support: The five-factor model (FFM). Costa & McCrae (1990, p. 23) de- 
fined traits as "dimensions of individual differences in tendencies to show consistent 
patterns of thoughts, feelings, and actions." There are two key parts to this defini- 
tion. First, traits are dimensions, which are empirically verifiable concepts organiz- 
ing human behavior along a continuum. Second, individuals differ or vary accord- 
ing to how much or how little of a particular trait they may possess. It is these 
differences in traits, then, that describe an individuals "personality." Costa and 
McCrae identified five primary rraits along which individuals differ — not dozens or 
hundreds; just five: Neuroticism, Extraversion, Openness, Agreeableness, and 
Conscientiousness. For example, the trait of Extraversion involves the intensity of 
interpersonal relationships. An individual can be described as introverted (i.e., shy, 
aloof, withdrawn) on one end of the continuum, extraverted (i.e., sociable, outgoing, 
adventurous, enthusiastic) on the other end of the continuum, or somewhere in be- 
tween (i.e., ambiverted). Most importantly, the amount of the trait an individual 
possesses can be measured and compared to some norm group to determine whether 
the individual displays an average, significantly higher, or significantly lower amount 
of the trait than other individuals with like characteristics (e.g., age, sex). The 
amount of a trait a client exhibits helps professional counselors understand and pre- 
dict client actions now and in the future. 

Costa and McCrae and other researchers have accumulated substantial evidence 
that these factors can be found on most multifaceted personality inventories available 
today (see Piedmont, 2006). The FFM has emerged as a fairly comprehensive taxon- 
omy, useful in classifying and understanding personality traits. The FFM traits and 
facets are closely aligned with those of the Revised NEC) Personality Inventory {NEO- 
I'l-R) (Costa & McCrae, l ( ) l )2), which will also be reviewed later in this chapter. 

Because traits are often described as existing on a continuum (e.g. introversion- 
extraversion, agreeable-disagreeable, conscientiousness-carelessness), some researchers 
and Ksi developers have found it helpful 10 juxtapose these continua in order to cat- 



Personality Assessment 27 1 

egorize or label people according to some typology — for example, juxtaposing the 
Extraversion and Neuroticism traits results in four "types" of clients. A client who is 
high on both traits (i.e., high extraversion, high neuroticism) may be hot tempered, 
impulsive, or easily influenced. Someone who is low on both traits (i.e., low extraver- 
sion, low neuroticism) may be calm, impassive, and reliable. One who is high on ex- 
traversion and low on neuroticism may be easygoing, talkative, and optimistic. One 
who is low on extraversion and high on neuroticism may be pessimistic, sad, and 
sober. Note the consistent use of the phrase "may be," for these characteristics are cer- 
tainly not representative of all individuals of a given type under all circumstances. Still, 
research (and common sense) indicates that the more of a given trait one possesses, the 
more stable the categorization, and the greater the predictive validity. 

While juxtaposing two or more continua can be done with virtually any set of 
traits, some tests and theories are predicated on such a system. For example, the 
Myers-Briggs Type Indicator — Form M {MBTI) (Myers, McCaulley, Quenk & 
Hammer, 1998), a very commonly used personality inventory, was based upon the 
theory of Carl Jung (1923). With the exception of the MBTI, the development and 
use of tests based on typologies has been on the decline over the past several decades, 
ostensibly due to increased societal sensitivity to stereotyping of people. Likewise, 
numerous cautionary chimes have been sounded regarding the potential dangers of 
using personality instruments with clients from culturally diverse backgrounds 
(Anderson, 1995; Campos, 1989; Hinkle, 1994). In the final analysis, the focus 
among structured personality assessment today is firmly on the objective measure- 
ment and analysis of personality traits for their descriptive and predictive value. 



Strengths and Limitations of the Trait Approach 



Traits have substantial potential value when used judiciously by professional coun- 
selors. Piedmont (2006) suggested that professional counselors can use traits ap- 
proaches in six primary ways: (1) understanding the client; (2) making differential 
diagnoses; (3) establishing empathy and rapport; (4) giving feedback and insight; (5) 
anticipating the course of therapy; and (6) matching treatments to clients. 

Structured trait approaches to personality assessment have several noteworthy 
strengths. Trait inventories are relatively easy to administer, score, and interpret, ei- 
ther by hand or by computer. Most trait inventories are also norm referenced, allow- 
ing comparison of an individual's scores to a norm group. This allows examiners to 
determine whether clients have an average amount of a given trait, higher than av- 
erage amounts, or lower than average amounts. Remember that knowing how much 
of a given trait an individual possesses is often useful in predicting client actions and 
outcomes. 

Perhaps the greatest strength of trait approaches to personality assessment is that 
they focus on normal, healthy personality functioning, not just the clinical or patho- 
logical aspects of personality. In this way, they help us to understand a client's 
strengths and protective factors, rather than providing a myopic focus on a client's 
weaknesses and vulnerabilities. 

Because traits are empirically derived constructs, they actually do exist in nature 
and can be observed and measured reliably. Traits also usually have robust predictive 



272 Chapter 8 



validity that can be empirically verified. In fact, research on the FFM has shown 
traits can predict a significant amount of variance across a wide range of clinical out- 
comes. Thus professional counselors can rely on knowledge of client traits to develop 
rapport, communicate in the most effective therapeutic manner, and, in general, 
structure treatment in the most efficacious manner. 

Trait inventories are also amenable to computer scoring and interpretation, 
which can save professional counselors time and clients money. The standardized 
programming of computerized reports also tends to minimize scoring errors and ex- 
aminer bias in judgment and interpretation. In addition, predictions and narrative 
written into the program usually are based on empirical evidence. This is in contrast 
to constructed commentary by examiners who vary substantially in experience and 
expertise. On the flip side, computer programs are frequently criticized for promot- 
ing a loss of individuation (i.e., every report sounds the same). Because examiners 
almost never have access to the programming language, it is usually impossible to 
evaluate the source and veracity of narrative statements generated by the report, or 
even the standard scores derived by internal scoring and conversion programs (Note: 
Fortunately, norm tables for most computerized interpretive tables are still published 
in hard-copy formats so clinicians can verify score accuracy by hand if necessary). 
Finally, given the boilerplate statements generated by many computerized programs, 
some professional counselors may question the accuracy of interpretive statements 
for the actual client being assessed. Several of the tests reviewed below and in the pre- 
vious section have examples of computer-generated reports. 

While very helpful, trait approaches do not escape substantial criticism. Some of 
the criticism is more theoretical or philosophical, while some involves more practi- 
cal aspects. In regard to the theoretical and practical issues, some question how use- 
ful and helpful descriptions of personality can possibly be without some overriding 
theory to hold them together and bring meaning in some holistic manner. Indeed, 
little explanation or rationale has been offered as to why the traits even exist, how 
they develop and become differentiated over time, or even the degree to which each 
is genetically determined or environmentally influenced. On a more philosophical 
level, trait approaches are sometimes criticized for being tautological (redundant) in 
nature; that is, we know that outgoing, energetic, and sociable people are extraverted 
because extraverted people are outgoing, energetic, and sociable (Piedmont, 2006). 

Another criticism is that different models predict different numbers of primary 
traits. While this may be expected on the basis of one's theoretical orientation, please 
recall that there is no theoretical orientation. These models are statistically derived 
subjected to empirical validation (i.e., "I exist (statistically); therefore I am"). Much 
of the recent evidence supports the five-factor model. But are there more than five 
factors? Costa and McCrae do not deny the possibility, and a research associate of 
theirs, Ralph Piedmont (2006), has identified a sixth factor, spirituality, using the 
same methodology that Costa and McCrae used to derive the original five factors. A 
holistic, integrative explanation based in theory is a critical next step in making trait 
approaches more explanatory (note the tautological emphasis). 

There are several criticisms of trait approaches grounded more in the realm of 
pragmatics, first, self-report instruments usually only measure superficial portions of 
personality functioning that a client or observer ot the client could also readily iden- 



Personality Assessment 273 

tify through an effective interview process. In a related criticism, trait approaches often 
lack the explanatory depth of projectives (psychoanalysis) and provide less insight into 
the client's internal world. Relatedly, professional counselors must ensure that all per- 
sonality assessment is conducted according to the highest degree of ethical practice 
and guard against an invasion of privacy or inappropriate disclosure of information to 
others who may misunderstand or misuse the results (e.g., discriminate against clients 
with "undesirable" characteristics by limiting their opportunities). 

Finally, a primary criticism continues to be that self-report trait inventories are 
a relatively transparent means of obtaining information about clients. As such, trait- 
based inventories are susceptible to client response sets and faking (e.g., acquies- 
cence, nonacquiescence, malingering, socially desirable responses). It is inevitable 
that some clients will answer in a guarded manner, while others will be too self-crit- 
ical. More and more structured inventories are including validity scales to allow pro- 
fessional counselors to identify clients who may be presenting with a response set 
that could invalidate interpretations. 



SOME COMMONLY USED STRUCTURED PERSONALITY 
ASSESSMENT INVENTORIES 

Revised NEO Personality Inventory [NEO-PI-R) 



The Revised NEO Personality Inventory {NEO-PI-R) (Costa & McCrae, 1992) is a 
240-item inventory designed to measure the five major dimensions of personality 
and is best used as a basic research instrument (Botwin, 1995; Digman, 1990; 
Goldberg, 1992; Piedmont, 2006). The NEO-PI-R usually requires about 25 to 35 
minutes for an adult to complete, and hand scoring can be done quickly. Scale items 
measure Neuroticism, Extraversion, Openness to Experience, Agreeableness, and 
Conscientiousness, and each of these scales has six subscales (Botwin, 1995). Table 
8.1 contains factor facets and descriptions from the NEO-PI-R (Costa & McCrae, 
1992). These scales use both a self-report and an observer-rater form and can be in- 
dividually or group administered. Scores are derived from a 5-point Likert scale 
ranging from Strongly Agree (1) to Strongly Disagree (5), and are translated into T 
scores for interpretation. Sample items include "Watching sports bores me," "I often 
feel calm and relaxed," and "It is easy for me to take charge of situations." 

The self-rating, stratified sample consisted of 500 men and 500 women 
(screened from a larger pool of 2,273 people) and was selected demographically to 
match 1995 U.S. Census projections. The attention to sample selection is an im- 
provement over the NEO-PI (Botwin, 1995). Observer rating norms were obtained 
from 143 ratings of 73 men and 134 ratings of 69 women from both spouses and 
multiple peer ratings (Costa & McCrae, 1992; Piedmont, 2006). Internal consisten- 
cies for individual facet scales ranged from r = 0.56 to r = 0.81 in self-reports and 
from r = 0.60 to r = 0.90 in observer ratings (Costa & McCrae, 1992). Test-retest 
reliabilities for facet scales on the original NEO ranged from r = 0.66 to r = 0.92 
(McCrae & Costa, 1983). The NEO-PI-R correlated with similar scales, and con- 
struct, convergent and divergent validity were found to be adequate. 



274 Chapter 8 



Table 8.1 NEO-PI-R descriptions of traits and facets 



Domains 

N: Neuroticism 
E: Extraversion 
O: Openness 

A: Agreeableness 

C: Conscientiousness 

Neuroticism facets 

Nl: Anxiety 

N2: Angry Hostility 

N3: Depression 

N4: Self-Consciousness 

N5: Impulsiveness 

N6: Vulnerability 

Extraversion facets 

El: Warmth 

E2: Gregariousness 

E3: Assertiveness 

E4: Activity 

E5: Excitement seeking 

E6: Positive emotions 

Openness facets 
Ol: Fantasy 
02: Aesthetics 
03: Feelings 

04: Actions 
05: Ideas 
06: Values 

Agreeableness facets 

Al: Trust 

A2: Straightforwardness 

A3: Altruism 

A4: Compliance 

A5: Modesty 

A6: Tender-mindedness 

Conscien tio usness facets 

CI: Competence 

C2: Order 

C3: Dutifulness 

C4: Achievement striving 

C5: Self-discipline 

C6: Deliberation 



General tendency to experience negative affects 

Sociability, assertiveness, activeness, talkativeness 

Active imagination, aesthetic sensitivity, attentiveness to inner feelings, preference for variety, 

intellectual curiosity, independence of judgment 

Interpersonal tendencies, altruism, sympathy, eagerness to help 

Control of impulses, management of desires 



Apprehensive, fearful, prone to worry, nervous, tense, jittery 

Tendency to experience anger and related states 

Tendency to experience depressive affect 

Emotions of shame and embarrassment, uncomfortable around others 

Inability to control cravings and urges 

Vulnerability and inability to cope with stress 



Issues of interpersonal intimacy 

Preference for other peoples company 

Tendency toward dominance, forcefulness, and social ascendancy 

Tendency toward rapid tempo and vigorous movement (energy) 

Tendency to crave excitement and stimulation 

Tendency to experience positive emotions 



Intensity of imagination and fantasy life 

Appreciation for and interest in art and beauty 

Openness to feelings, receptivity to one's own inner feelings, evaluation of emotion as an 

important part of life 

Behavioral willingness to try different activities, etc. 

Intellectual curiosity, open-mindedness, willingness to consider new things, ideas 

Readiness to reexamine social, political, and religious values 

Tendency to trust or distrust others 

Frankness, sincerity, and ingenuousness relative to others 

Concern for others' welfare, generosity, consideration of others 

Characteristic reactions to interpersonal conflict 

Humbleness, self-efficacy 

Attitudes of sympathy and concern for others 

Sense that one is capable, sensible, prudent, and effective 

Tidiness, level of organization 

Governed by conscience 

Levels of aspiration and hard work toward goals 

Ability to begin tasks and carry them through to completion 

Tendency to think carefully before acting 



Source: Revised NEO Personality Inventory (NFO-I'I-R) andNEO Five-Factor Inventory (NFO-FFF) Professional Manualhy P. T Costa Jr. & R. R. 
McCrac, (1992). Odessa, HI.: Psychological Assessment Resources. 



Personality Assessment 275 



Think About It 8.1 Using Table 8. 1 , describe your own personality 
using the five-factor model. 



16 Personality Factors (16PF) Questionnaire 



The 16PF Questionnaire (Cattell et al., 1993) is a 185-item self-report inventory for 
clients ages 16 years to adult and is designed to measure normal personality character- 
istics, problem-solving abilities, and preferred work activities and to identify problems 
in areas known to be problematic to adults. Items of the 16PF measure Anxiety, 
Extraversion, Independence, Self-Control, and Tough-Mindedness (Erford, 2006) 
and can be used to predict vocational interest as classified by Holland's occupational 
typology (Conn & Rieke, 1994). The 16PF may prove helpful as a career counseling 
tool and as a work behavior and work attitude device (Vansickle & Conn, 1996). 
Administration of the 16PF requires a 5th-grade reading level and can be conducted 
for individuals or groups by paper and pencil in 30 to 50 minutes, or in 25 to 35 min- 
utes by computer (Russell & Karol, 1994). Scoring can be done by hand using four 
scoring keys, a norm table, and an Individual Record form, or by computer through 
a mail-in scoring service or the Institute for Personality and Ability Testing's (IPAT) 
OnSite System software. Raw scores are converted into standardized (sten) scores that 
are based on a 10-point scale (M= 5.5; SD = 2) (Russell & Karol, 1994). Sample items 
include "I often like to watch team games, a) true; b) false," and "I prefer friends who 
are: a) quiet; b) ?; c) lively." A portion of a sample computerized 16PF Basic 
Interpretive Report from IPAT is provided in Table 8.2. Professional counselors may 
also be interested in the Karson Clinical Report {KCR) and Cattell Comprehensive 
Personality Interpretation (CCPI). Sample reports can be viewed at www.ipat.com. 

The stratified normative sample (n = 2,500) consisted of approximately equal 
numbers of males and females from every U.S. state and the District of Columbia, 
closely representing the demographic variables of gender, race, age, and education in 
the 1990 U.S. census. Reliability reports of scores on the 16PF are low, with only 
the Social Boldness scale consistently above r = 0.80 (Erford, 2006). Clinicians 
should be cautious when using this inventory for high school graduates and people 
over age 65, because these were underrepresented in the normative sample 
(McLellan, 1995). While the 16PF may prove helpful in developing or confirming 
hypotheses about client personality characteristics, score reliability and validity are 
generally inadequate for decision-making purposes, unless used in conjunction with 
multiple sources of information. 

One of the primary criticisms of the 16PF continues to be the identification of 
too many primary factors (Chernyshenko, Stark, & Chan, 2001; Digman & Inouye, 
1986), and second-order factor analytic studies indicate that about 4 to 6 factors ex- 
plain the items' variance to a more substantial degree; after all, many of the 16 fac- 
tors are highly intercorrelated. The addition of impression management scales are a 
benefit in interpretation (Schueger, 1992). 



276 Chapter 8 



Table 8.2 16PF Basic Interpretive Report for a 33-year-old female. 



RESPONSE STYLE INDICES 

Index Raw Score 



Impression Management 19 within expected range 

Infrequency within expected range 

Acquiescence 51 within expected range 

All response style indices are within the normal range. 



16PF PROFILE 

Sten Factor 



Left meaning 



Low Average High 



GLOBAL FACTORS 



Right meaning 







1 2 3 




8 9 10 




6 


Warmth (A) 


Reserved 




— 


Warm 


9 

7 


Reasoning (B) 
Emotional Stability (C) 


( onrrpfp 




+ 


A K^rrai-r 


VjVJ 1 1 V_ 1 \. 1 1. 

Reactive 






Emotionally Stable 


6 


Dominance (E) 


Deferential 




— 


Dominant 


5 


Liveliness (F) 


Serious 






Lively 


6 


Rule-Consciousness (G) 


Expedient 






Rule-Conscious 


8 


Social Boldness (H) 


Shy 




+ 


Socially Bold 






7 


Sensitivity (I) 


Utilitarian 







Sensitive 


4 


Vigilance (L) 


Trusting 






Vigilant 


7 


Abstractedness (M) 


Grounded 







Abstracted 


4 


Privateness (N) 


Forthright 






Private 


6 


Apprehension (O) 


Self-Assured 




- 


Apprehensive 


9 


Openness to Change (Ql) 


Traditional 




+ 


Open to Change 






4 


Self-Reliance (Q2) 


Group-Oriented 






Self-Reliant 


4 


Perfectionism (Q3) 


Tolerates Disorder 






Perfectionistic 


6 


Tension (Q4) 


Relaxed 






Tense 



Sten 


Factor 


Left meaning 


Low Average High 


Right meaning 


7 
5 
2 
7 
5 


Extraversion 

Anxiety 

Tough-Mindedness 

Independence 

Self-Control 


Introverted 
Low Anxiety 
Receptive 
Accommodating 
Unrestrained 


12 3 4 5 


6 7 8 9 10 


Extroverted 
High Anxiety 
lough-Minded 
Independent 
Self-Controlled 


♦- 
♦- 



TOUGH-MINDEDNESS 

Tough-Mindedness is low. Ms. Female tends to value breadth and variety of experience. Including openness to different ideas, 
people, or situations. When approaching problems, she may focus on subjective or emotional considerations rather than cold, 
hard facts. 

■ Ms. Female <.an be sensitive to emotional and aesthetic considerations. 

■ She often gets absorbed in ideas and thoughts. 

■ Sin- is open to change and enjoys pursuing new ideas, opinions, and experiences. 



Personality Assessment 277 

EXTRAVERSION 

Extraversion is high-average. Ms. Female is socially participative and probably enjoys activities involving others. Her attention is 
generally directed toward other people. 

■ Because this person is often socially bold, she is unlikely to feel intimidated in group settings. She may be relatively unaffected 
by insults or threats. 

■ When Ms. Female chooses to reveal personal matters to others, she tends to be forthright and genuine. 

■ Ms. Female shows a tendency to do things and make plans with others rather than alone. 

INDEPENDENCE 

Independence is high-average. Generally, Ms. Female prefers to lead an independent and self-directed life. Although she can 
sometimes be accommodating to others' wishes, she may often assert control or be persuasive. 

■ This person is venturesome and expressive, especially in front of others. Extreme boldness sometimes can be associated with a 
high desire for influence and attention. 

■ Vigilance does not appear to shape her stance on influencing or persuading others. She tends to trust other people's 
motivations rather than to question them. 

■ She is experimenting and has an inquiring, critical mind. She tends to question traditional methods and to press for new 
approaches. 

ANXIETY 

At the present time, Ms. Female presents herself as no more or less anxious than most people. 

■ Usually, Ms. Female meets challenges with calm and inner strength. 

■ She shows a tendency to be trusting and accepting of other people and their motives. 

SELF-CONTROL 

Self-Control is average. At times, Ms. Female may show the self-discipline and conscientiousness needed to meet her 
responsibilities. At other times, she may be less restrained, following her own wishes. 

■ Because this individual tends to be preoccupied with ideas, she may disregard the practical aspects of a situation. 

■ This individual seems to balance casualness and a tolerance for disorder with the need for organization and structure. She may 
function best in an unexacting, flexible setting rather than in a rigid system. 

SELF-ESTEEM AND ADJUSTMENT 

Overall, this individual tends to view herself positively, having a strong sense of self-worth and competence. She is likely to be 
capable of obtaining most of her personal goals. Self- Esteem is high-average (7). 

The degree of emotional stability shown by Ms. Female is typical of most adults. That is, most of the time she tends to be 
calm and relaxed, but in demanding situations, she may be reactive or upset. Emotional Adjustment is average (6). 

Not only is Ms. Female likely to feel quite comfortable in social gatherings, but she may initiate contact, lead conversations, 
and draw attention to herself. She probably will not hesitate to express what she needs from others. Social Adjustment is high (8). 

SOCIAL SKILLS 

The following six scales pertain to the ways in which information is communicated in social environments. The scales are broadly 
divided into two categories: nonverbal communication (Emotional Scales) and verbal communication (Social Scales). Within 
each category, communication skills are discussed at three more specific levels: the ability to send information (Expressivity), to 
receive and interpret messages (Sensitivity), and to control information (Control). Although a person may be more or less skilled 
in certain areas, overall social competence is reflected in a general balance among the six scales below. 

Ms. Female's communication is predicted to be demonstrative and forceful. That is, her emotional displays are probably 
uninhibited and genuine. Her emotions are likely to be easily perceived by others, and thus are likely to influence the emotional 
states of those around her. Emotional Expressivity is high (8). 

continued 



278 Chapter 8 
Table 8.2 continued 



This person may enjoy observing other people's gestures, moods, and nonverbal interactions. Thus, she may feel comfortable 
interpreting people's emotional and other nonverbal messages. Emotional Sensitivity is high-average (7). 

At times, Ms. Female may adapt her emotional displays to the given situation. At other times, she may be unable to suppress 
a strongly felt emotion. Emotional Control is average (5). 

This person is probably outgoing and articulate and would often make a good first impression. She may feel comfortable 
with verbal disclosure and could probably join in most discussions with relative ease. Social Expressivity is high-average (7). 

Ms. Female may not be very concerned about monitoring or interpreting others' social behavior or mannerisms. Ms. 
Female's self-comfort may mean that she is not overly concerned about the appropriateness of her own actions. Social Sensitivity 
is low-average (4). 

This person projects a comfortable social presence. That is, she probably presents herself well in just about any type of social 
situation and is likely to participate with any social group. She may consider the appropriateness of when to speak up and when 
to withhold comment according to the demands of a given situation. Social Control is high (9). 

This person is attentive to other people and is likely to be sensitive to their feelings. She is probably willing to consider 
another person's point of view. As a consequence, others may seek her out for sympathy and support. Ms. Female should be 
careful not to allow the problems of others to override her own. Empathy is high (8). 

LEADERSHIP AND CREATIVITY 

In a group of peers, potential for leadership is predicted to be average (6). 

At the client's own level of abilities, potential for creative functioning is predicted to be high (8). She probably has the sense 
of adventure, assertiveness, and orientation toward ideas that are necessary for pursuing creative interests. 

Ms. Female shows characteristics somewhat similar to persons who invest a lot of time producing novel or original works. 
Should this individual choose to pursue creative endeavors, her rate of output is predicted to be above average (7). 

VOCATIONAL ACTIVITIES 

Different occupational interests have been found to be associated with different personality qualities. The following section 
compares Ms. Female's personality to these known associations. The information below indicates the degree of similarity between 
Ms. Female's personality characteristics and each of the six Holland Occupational Types (Self-Directed Search; Holland, 1985). 
Those occupational areas for which Ms. Female's personality profile shows the highest degree of similarity are described in greater 
detail. Descriptions are based on item content of the Self-Directed Search as well as the personality predictions of the Holland 
types as measured by the 16PF. 

Remember that this information is intended to expand Ms. Female's range of career options rather than to narrow them. All 
comparisons should be considered with respect to other relevant information about Ms. Female, particularly her interests, 
abilities, and other personal resources. 

123456789 10 



HOLLAND THEMES 


Sten 


Factor 


9 


Artistic 


7 

7 


Investigative 
Social 


6 

5 


Enterprising 
Realistic 


4 


Conventional 


Artisti 


c = 9 



Ms. Female shows personality characteristics similar to Artistic persons, who are self-expressive, typically through a particular 
mode such .is art, music, design, writing, acting, composing, etc. Like Artistic persons, Ms. Female may be venturesome and open 
in different views and experiences. Sometimes she may be preoccupied with thoughts and ideas, which may relate to the overall 



Personality Assessment 279 

creative process. She may do her best work in an unstructured, flexible environment. It may be worthwhile to explore whether 
Ms. Female appreciates aesthetics and possesses artistic, design, or musical talents. 
Occupational Fields: Art 

Music 

Design 

Theater 

Writing 

Investigative = 7 

Ms. Female shows personality characteristics similar to Investigative persons. Such persons typically have good reasoning ability 
and enjoy the challenge of problem solving. They tend to have critical minds, are curious, and are open to new ideas and 
solutions. Investigative persons tend to be reserved and somewhat impersonal; they may prefer working independently. They tend 
to be concerned with the function and purpose of materials rather than aesthetic principles. Ms. Female may enjoy working with 
ideas and theories, especially in the scientific realm. It may be worthwhile to explore whether Ms. Female enjoys doing research, 
reading technical articles, or solving challenging problems. 
Occupational Fields: Science 

Math 

Research 

Medicine and Health 

Computer Science 

Social = 7 

Ms. Female shows personality characteristics similar to Social persons, who indicate a preference for associating with other 
people. Such interactions are distinguished by a nurturing, sympathetic quality. Ms. Female may find it very easy to relate to all 
kinds of people. In addition to being warm and friendly, Social persons are typically receptive to different views and opinions. 
They feel most comfortable in positions that allow for regular social interaction. It might be worthwhile to explore whether Ms. 
Female enjoys working with others and having them seek her out for advice or comfort. 
Occupational Fields: Teaching 

Counseling 

Psychology 

Social Work 

Health Services 

Source: Copyright © 1994, The Institute of Personality and Ability Testing, Inc., Champaign, IL. All rights reserved. Reproduced with permission 
of the Institute of Personality and Ability Testing, Inc. 

Note: The original 16PF Basic Interpretive Report included graphical score displays for each interpreted factor. These graphs have been removed to 
conserve space. The 16PF Basic Interpretive Report usually generates a 10-page report. 



Myers-Briggs Type Indicator-Form M (MBTI) 

The Myers-Briggs Type Indicator — Form M (MBTI) (Myers, McCaulley et al., 1 998) is 
a 93-item self-report inventory for clients ages 14 years and older. Based on Jungian 
theory, items measure four different bipolar continua: Extraversion-Introversion 
(E-I), Sensing-Intuition (S-N), Thinking-Feeling (T-F), and Judging-Perceiving 
(J-P). These scales result in four-letter combinations that identify and describe 16 per- 
sonality types (see Table 8.3). Sample items include "Are you: easy to get to know, or 
hard to get to know?" and "Can you: talk easily to almost anyone for as long as you 



280 Chapter 8 



Table 8.3 Examples of associated traits with MBTI typologies 

Example Typology 1: Introverted-Intuition- Thinking-Judging (INTJ) 

Have original minds and great drive for implementing their ideas and achieving their goals. 
Quickly see patterns in external events and develop long-range explanatory perspectives. When 
committed, organize a job and carry it through. Skeptical and independent, have high standards 
of competence and performance - for themselves and others. 

Example Typology 2: Extroverted-Sensing-Feeling-Perceiving (ESFP) 

Outgoing, friendly, and accepting. Exuberant lovers of life, people, and material comforts. Enjoy 
working with others to make things happen. Bring common sense and a realistic approach to 
their work, and make work fun. Flexible and spontaneous, adapt readily to new people and 
environments. Learn best by trying a new skill with other people. 

Source: Introduction to type (6th ed.) by I. B. Myers, L. K. Kirby, & K. D. Myers, (1998), p. 13. Palo Alto, 
CA: Consulting Psychologists Press. 

have to, or find a lot to say only to certain people or under certain conditions?" The 
MBTI requires a 7th-grade reading level and takes about 15 to 25 minutes to admin- 
ister. This inventory can be hand-scored or computer-scored. Forced-choice items 
produce responses that are weighted in points. The normative sample {n = 3,009) con- 
sisted of U.S. adults ages 18 years and older, generally representing sex and ethnicity 
consistent with the 1990 U.S. Census, although White women were overrepresented 
and Black men were underrepresented (Myers, McCaulley, et al., 1998). 

Split-half reliability falls above an acceptable range of 0.90 for the national sam- 
ple. Test-retest reliability (4-week interval), ranged from r= 0.83 to r = 0.97, and in- 
ternal consistency (coefficient alpha) for males and females ranged from r = 0.90 to 
r = 0.93 (Myers, McCaulley et al., 1998). Validity of the MBTI is moderate to high 
when correlated with the five-factor model as portrayed in the NEO PI-R (Erford, 
2006). Construct validity was found for each of the four dichotomies (Erford, 2006; 
Myers, McCaulley et al., 1998). More than 3 million people are administered the 
MBTI each year (Michael, 2003). This inventory can be used to increase insight 
(Fleener, 2001), to assist in career counseling in conjunction with human resource 
issues (Capraro & Capraro, 2002), and to identify obstacles to career development 
(Healy & Woodward, 1998). Clinicians should note that the artificial manner with 
which the MBTI types people may not lead to meaningful descriptions (Vacha- 
Haase & Thompson, 1999), and clients may feel restricted by reporting specific be- 
haviors, attitudes, career choices, or interests (Watkins & Campbell, 2000) because 
of the forced-choice test construction. While the MBTI does appear to measure at 
least four important personality dimensions, the evidence does not support the es- 
tablishment of 16 unique personality types (Johnson, Mauzey, Johnson, Murphy, & 
Zimmerman, 2002). Finally, as with all self-report instruments, it is difficult to con- 
firm the accuracy of self-perceptions constituting an MBTI client typology 
(Gailbreath, Wagner, Moffett, & Hein, 1997; Gardner & Martinko, l l )%), espe- 
cially when no response validity measures are provided. 



Personality Assessment 281 



Mi lion Index of Personality Styles Revised {MIPS Revised) 



The Millon Index of Personality Styles Revised {MIPS Revised) (Millon, 2003) is a 180- 
item true-false Level B self-report instrument for adults ages 1 8 years and older and is 
designed to measure personality styles of normally functioning adults. Scale names 
and the profile display of the original MIPS were updated to provide administrators 
with a better, more intuitive approach to interpreting test results. This inventory 
measures three dimensions of normal personality using 6 Motivating Style scales 
(Pleasure-Enhancing, Pain-Avoiding, Actively Modifying, Passively Accommo- 
dating, Self-Indulging, Other-Nurturing); 8 Thinking Style scales (Externally 
Focused, Internally Focused, Realistic/Sensing, Imaginative/Intuitive, Thought- 
Guided, Feeling-Guided, Conservation-Seeking, Innovation-Seeking); 10 Behaving 
Style scales (Asocial/Withdrawing, Gregarious/Outgoing, Anxious/Hesitating, 
Confident/Asserting, Unconventional/Dissenting, Dutiful/Conforming, Submissive/ 
Yielding, Dominant/Controlling, Dissatisfied/Complaining, Cooperative/Agreeing); 
and 4 Validity Indices that provide information about Positive Impression, Negative 
Impression, Consistency, and Clinical Index. The MIPS Revised takes about 30 min- 
utes to complete using either the paper-and-pencil or computer format. An 8th-grade 
reading level is required, and it is important to designate age and gender to obtain an 
accurate report. The MIPS Revised can be scored by hand, computer, mail-in, or op- 
tical scanning methods. 

The MIPS Revised test offers separate norms for adults and college students, and 
for both separate and combined genders. The adult sample consisted of 1,000 indi- 
viduals (500 females, 500 males) ages 18-65 years and is stratified according to the 
U.S. population by age, race or ethnicity, and education level (Millon, 2003). The 
college sample consisted of 1,600 students (800 males, 800 females) selected from 14 
colleges and universities to be representative of a college student population in terms 
of ethnicity, age, year in school, major area of study, region of the county, and type 
of institution. The MIPS Revised can be used as a screening tool in employee selec- 
tion; for employee assistance programs and leadership and employee development 
programs; in career planning for high school and college students; in the curriculum 
for college courses in psychological testing; and in relationship, premarital, marriage, 
and individual counseling. 



Personality Assessment Inventory (PAI) 



The Personality Assessment Inventory (PAI) (Morey, 1991) is used to assess behaviors 
related to psychopathology as well as to provide information for screening, clinical 
diagnosis, and treatment. It can be administered in individual or group formats to 
clients ages 18 years to adult in about 40 to 50 minutes. There are 344 items on this 
self-reported inventory, and responses are based on a 4-point scale (Not at All True, 
Slightly True, Mainly True, and Very True). The PAI requires a 4th-grade reading 
level. There are 22 nonoverlapping scales, including 4 validity scales (Inconsistency, 
Infrequency, Negative Impression, Positive Impression); 1 1 clinical scales (Somatic 



282 Chapter 8 



Complaints, Anxiety, Anxiety-Related Disorders, Depression, Mania, Paranoia, 
Schizophrenia, Borderline Features, Antisocial Features, Alcohol Problems, Drug 
Problems); 5 treatment scales (Aggression, Suicidal Ideation, Stress, Nonsupport, 
Treatment Rejection); and 2 interpersonal scales (Dominance, Warmth). Answers 
can be scored by hand or by optical scanning, and raw scores can be converted into 
T scores (Boyle, 1995). 

Standardization samples conformed to U.S. population demographics with re- 
spect to the test's diagnostic groups (Kavan, 1995). Reliability of scores seems ques- 
tionable based on the wide range of coefficients for different variables. Internal con- 
sistency coefficients for the 22 scales ranged from r = 0.45 to r = 0.90, with a median 
of 0.81 (normative sample); from r = 0.22 to r = 0.89, with a median of 0.82 (col- 
lege sample); and from r = 0.23 to r = 0.94, with a median of 0.86 (clinical sample). 
Median alphas were consistent between various races, ages, and genders in the mid 
to high 0.70s. Test-retest reliability coefficients (3- to 4-week interval) ranged from 
r = 0.31 to r = 0.92, with a median of 0.82 (Boyle, 1995). Correlation studies with 
the Minnesota Multiphasic Personality Inventory (MMPI) and the Marloive-Crowne 
Social Desirability Scale yielded mixed validity results. Even with the disputed relia- 
bility and validity information, Kavan (1995) viewed the PAIas a competitor of the 
MMPI-2 that is easier to administer, score, and interpret. 



California Psychological Inventory (CPI) 



The California Psychological Inventory (CPI) (Gough & Bradley, 1996) is a 434-item 
inventory designed to assess personality characteristics and to predict what people 
will say and do in specified contexts. The CPI has numerous questions that overlap 
with the original MMPI but was designed for a different population and purpose 
than the MMPI (i.e., personality descriptions of a nonclinical population). Scale 
items measure 20 Folk scales (Dominance, Capacity for Status, Sociability, Social 
Presence, Self-Acceptance, Independence, Empathy Responsibility, Socialization, 
Self-Control, Good Impression, Communality, Well-Being, Tolerance, Achievement 
via Conformity, Achievement via Independence, Intellectual Efficiency, 
Psychological-Mindedness, Flexibility, and Femininity-Masculinity); 3 Vector scales 
(Internality-Externality, Norm-Questioning-Favoring, and Self-Realization); and 13 
Special Purpose scales. These scales are for clients ages 13 years and older, are writ- 
ten at a 5th-grade reading level, and take about 45 to 60 minutes to administer 
(Atkinson, 2003). The CPI is self-administered and can be done using either pencil 
and paper or a computer. Forms are scanned for automated data entry. Using the 
scores from the three Vector scales, a cuboidal personality typology is developed, 
which helps to classify individuals into four categories (Atkinson, 2003). 

The normative sample (n = 6,000; 3,000 of each gender) was reported as not 
being representative or random because of use of primarily high school students 
(50%) and undergraduate students (16.7%), so these are probably the best popula- 
tions for which to use the instrument, though the manual provides useful reference 
tables for comparing students of various ages (Hattrup, 2003). The test produced in- 
ternal consistency Cronbachs alpha estimates on the 20 Folk scales ranging from 



Personality Assessment 283 

r = 0.43 to r = 0.85, with a median of 0.76. For the three Vector scales, the internal 
consistency estimates ranged from r = 0.77 to r = 0.88. Cronbach's alpha for the 13 
specialty scales ranged from r = 0.45 to r = 0.88. Alpha reliabilities of the CPI scales 
ranged from r = 0.62 to r = 0.84 in the total sample, with a median of 0.77. Test- 
retest reliabilities were based on samples of 1 08 males and 1 29 females who were 
retested after a 1-year interval, and samples of 91 females and 44 males who were 
retested after 5- and 25-year intervals, respectively. For the 1-year retest, scale relia- 
bilities ranged from r = 0.51 to r = 0.84, with a median of 0.68. For the 5-year and 
25-year retest, reliabilities ranged from r = 0.36 to r = 0.73, and r = 0.37 to r = 0.84, 
respectively. Test-retest reliability estimates among high school students were be- 
tween 0.60 and 0.80 for a 1-year period. The Folk and Vector scales had moderate 
to strong construct validity correlation scores (0.40 to 0.80), but the predictive 
power regarding individual behavior in a given situation was weak. 



Jackson Personality Inventory-Revised (IPI-R) 



The Jackson Personality Inventory-Revised (JPI-R) (Jackson, 1994) is an inventory 
consisting of 300 true-false statements designed to produce "a set of measures of per- 
sonality reflecting a variety of interpersonal, cognitive, and value orientations" 
(Jackson, 1994, p. 1). Scale items represent 15 separate personality traits: Analytical 
(Complexity, Breadth of Interest, Innovation, Tolerance); Emotional (Empathy, 
Anxiety, Cooperativeness); Extroverted (Sociability, Social Confidence, Energy 
Level); Opportunistic (Social Astuteness, Risk Taking); and Dependable (Organiza- 
tion, Traditional Values, Responsibility). This inventory is used for adolescents and 
adults and takes approximately 35 to 45 minutes to administer. Raw scores range 
from to 20 and are converted to a profile sheet that references gender-specific 
norms using a vertical grid. Scoring can be done by hand in 3 minutes or can be 
done by mail, computer, or online to produce a comprehensive client report. Sample 
items include "I usually read several books at the same time," "I enjoy taking risks," 
and "I am seldom at a loss for words." The JPI-R is a Level B instrument. 

Internal consistency reliability estimates for the JPI-R were obtained from four 
college volunteer samples using the Cronbach alpha estimate (Jackson, 1994). In the 
largest college normative sample (n = 1,107), alpha estimates ranged from r = 0.66 
for the Complexity, Tolerance, and Social Astuteness scales to r = 0.87 for the 
Innovation scale. In all four samples, the reliability estimates range from r = 0.62 for 
Social Astuteness to r = 0.88 for Social Confidence. In two studies, median internal 
consistency reliabilities (Bentler's Theta) were 0.90 and 0.93. Tables in the manual 
provide validity correlations for the JPI-R with other psychological variables and 
scales, including the Minnesota Multiphasic Personality Inventory (MMPI), the Survey 
of Work Styles (SWS), and the Jackson Vocational Interest Survey (JVIS). Counselors 
will find the manual instructions for administration and scoring easy to follow and 
are cautioned that the JPI-R cannot be used to diagnose pathology (Pittenger, 1998). 
The JPI-R is a helpful measure of client dispositions and can be used to help clients 
develop insight and understand sources of resiliency. Table 8.4 provides a sample 
computerized interpretive report for the JPI-R. 



Table 8.4 Jackson Personality Inventory-Revised (JPI-R) Basic Report for Sam Sample, a 30-year-old male 



Your JPI-R Scale Profile 
The profile below is based on your responses to the JPI-R. For a better understanding of your scores, study the definitions and 
scale descriptions and follow the profile. 







Combined 


Female 


Male 


Scale 


Raw 


%ile 


%ile 


%ile 


Complexity 


14 


90 


88 


92 


Breadth of Interest 


19 


96 


96 


96 


Innovation 


18 


86 


90 


84 


Tolerance 


17 


93 


93 


95 


Empathy 


11 


38 


24 


54 


Anxiety 


2 


2 


1 


4 


Cooperativeness 


1 


4 


3 


4 


Sociability 


12 


69 


66 


73 


Social Confidence 


18 


86 


86 


86 


Energy Level 


18 


92 


96 


88 


Social Astuteness 


12 


73 


76 


69 


Risk Taking 


17 


97 


99 


95 


Organization 


14 


66 


66 


69 


Traditional Values 


3 


4 


3 


4 


Responsibility 


14 


50 


38 


58 



Male Percent Graph 

10 20 30 40 50 60 70 80 90 100 



□ 



□ 



FEMALE PERCENTILE 



MALE PERCENTILE 



RAW SCORE Your raw score for each scale is based on your responses to the statements that make up that 

scale. A high raw score indicates that you endorsed many of that scale's statements. 

COMBINED PERCENTILE This score is determined by comparing your raw score for each scale with the corresponding 

scores of a representative group consisting of both men and women. Your score is the 
percentage of the people in the representative group who received a score equal to or less than 
your score. 

This score is the percentage of women in the representative group who received a raw score 
equal to or less than your score. Use this score to determine how you compare to members of 
the opposite sex. 

This score shows how you compare to members of your own sex. Your score is the percentage 
of men in the representative group who received a raw score equal to or less than yours. The 
bar graph at the right of your profile is based on this score. 

[Examples of Selected Scale Descriptions] 

COMPLEXITY 

Your percentile rank on the Complexity scale is 92, placing you in the extremely high range. 

1 ligher Scorer Seeks intricate solutions to problems; is impatient with oversimplications; is interested in pur- 

suing topics in depth regardless ol their difficulty; enjoys abstract thought; enjoys intricacy. 

Low Scorer Prefers concrete to abstract interpretations; avoids contemplative thought; uninterested in 

probing for new insight. 

ANXIETY 

Your percentile rank on the Anxiety scale is 4, placing you in the extremely low range. 

Higher Scorer Tends to worry over inconsequential matters; more easily upset than the average person; 

apprehensive about the future. 
I ow Scon i Remains calm in stressful situations; takes things as they come without worrying; can relax in 

difficult situations; usually composed and collected. 



Your JPI-R Cluster Profile 

Male percent graph 
Scale Raw %ile %ile %ile 10 20 30 40 50 60 70 80 90 100 





Combined 


Female 


Male 


Raw 


%ile 


%ile 


%ile 


68 


97 


97 


97 


14 


4 


1 


7 


48 


90 


92 


88 


29 


96 


99 


90 


31 


24 


18 


31 



Analytical 
Emotional 
Extroverted 



Opportunistic 

Dependable 31 24 18 31 



JPI-R Cluster Descriptions 
The following cluster descriptions list the JPI-R scales that make up each cluster, as well as some of the traits found in high and 
low scorers. Also listed is the range into which your cluster score falls. Use this range to determine how strongly the high and/or 
low score traits apply to you. For more information on the scale scores that make up each of your cluster scores, refer back to the 
profile at the beginning of this report. 

ANALYTICAL 

Your percentile rank on the Analytical cluster is 97, placing you in the extremely high range. 

Your score on this cluster is derived from your scores on the JPI-R COMPLEXITY, BREADTH OF INTEREST, 
INNOVATION, and TOLERANCE scales. If you score high on this cluster of four scales, you might be expected to consider 
arguments from multiple points of view and may be inclined towards drawing distinctions among otherwise related elements of 
information. On the other hand, if you score low on this cluster, you might be expected to think of things in more black-and- 
white terms and to prefer straightforward, linear interpretations of events. 

EMOTIONAL 

Your percentile rank on the Emotional cluster is 7, placing you in the extremely low range. 

This second cluster includes the JPI-R EMPATHY, ANXIETY, and COOPERATIVENESS scales. A high score on this cluster 
indicates that you may express your feelings readily and that you may have difficulty hiding your emotions, especially under 
stressful conditions. If your score is low, you may be relatively unaffected by emotionally arousing situations and by social 
pressure. 

EXTROVERTED 

Your percentile rank on the Extroverted cluster is 88, placing you in the very high range. 

The//Y-/?SOCIALABILITY, SOCIAL CONFIDENCE, and ENERGY LEVEL scales make up this cluster. A high score on this 
cluster suggests that you are outgoing, sociable, and active. A low score indicates that you may be more introverted and less active. 

OPPORTUNISTIC 

Your percentile rank on the Opportunistic cluster is 90, placing you in the very high range. 

Your score on this cluster is based on your scores on the JPI-R SOCIAL ASTUTENESS and RISK TAKING scales. If you scored 
high on this cluster, you may be described as diplomatic, persuasive, skeptical, worldly, and charming. A low score suggests that 
you may be more direct, less adventurous, and less uncritical of the self-serving intentions of others. 

DEPENDABLE 

Your percentile rank on the Dependable cluster is 31, placing you in the low range. 

This cluster includes the JPI-R ORGANIZATION, TRADITIONAL VALUES, and RESPONSIBILITY scales. If your score on 
this cluster is high, you may tend to be methodical, predictable, systematic, conservative and mature in your attitudes. Should 
you score low, you may be considered to be more liberal-minded and flexible in your thinking, but less organized in your work 
habits. 

Source: Reproduced by permission of Sigma Assessment Systems, Inc., P.O. Box 610984, Port Huron, MI 48061-0984. 



286 Chapter 8 

Piers-Harris Children's Self-Concept Scale-Second Edition 
(Piers-Harris-2) 



The Piers-Harris Children's Self-Concept Scale, Second Edition {Piers-Harris-2) (Piers 
& Herzberg, 2002) is a 60-item self-report inventory used for children ages 7-18 
years who are able to read at a 2nd-grade reading level. The Piers-Harris-2 is designed 
to aid in the assessment of self-concept in children and adolescents. This inventory 
measures six cluster scales of Behavioral Adjustment (BEH), Intellectual and School 
Status (INT), Physical Appearance and Attributes (PHY), Freedom from Anxiety 
(FRE), Popularity (POP), and Happiness and Satisfaction (HAP). Sample items in- 
clude "I am smart," "I feel left out of things," and "I think bad thoughts." The Piers- 
Harris-2 takes about 10 to 15 minutes to complete using paper and pencil or com- 
puter and is available in Spanish. The inventory requires children to circle either Yes 
or No to indicate whether the statement describes the way they feel about them- 
selves. Raw scores (total number of responses marked in the positive direction) can 
be converted to percentiles, stanines, and T scores and are available in the form of an 
overall self-concept score or as a profile of six cluster scores. Scoring can be accom- 
plished by mail, fax, or computer (Piers & Herzberg, 2002). 

Restandardization of the Piers-Harris-2 utilized a sample of 1 ,387 students rang- 
ing from 7 to 18 years of age. These students were recruited from school districts all 
across the United States closely representing the ethnic composition of the U.S. pop- 
ulation according to the 2001 Bureau of the Census. Alpha coefficients for the Piers- 
Harris-2 cluster scale restandardization sample ranged from r = 0.74 for the 
Popularity scale to r = 0.81 for the three scales of Behavioral Adjustment, Intellectual 
and School Status (INT), and Freedom from Anxiety (FRE) (Piers & Herzberg, 
2002). Although test-retest reliability for the Piers-Harris-2 is not available, data for 
the original 80-item Piers-Harris reported reliability of r = 0.77 (2-month interval) 
and r= 0.77 (4-month interval) (Piers & Herzberg, 2002). Hattie (1992) reported 
a test-retest study (4-week interval) for the Piers-Harris total score and the six clus- 
ter scales using a sample of 135 Australian students in grades 10 through 12. 
Reliability coefficients ranged from r = 0.65 for the Happiness and Satisfaction scale 
to r = 0.88 for the Physical Appearance and Attributes scale (Piers & Herzberg, 
2002). The psychometric properties of the original Piers-Harris were also reviewed 
favorably (e.g., Chiu, 1988; Epstein, 1985;Jeske, 1985). The self-report feature was 
also viewed as a positive (Gans, Kenny, & Ghany, 2003; Riddle & Bergin, 1997). 
Professional counselors should note that the Piers-Harris-2 is not recommended for 
children who are unwilling or unable to cooperate in completing the questionnaire. 
It is also not recommended for children who are overtly hostile, uncooperative, un- 
communicative, prone to exaggeration or other distortions, or disorganized in their 
thinking. Children with poor English-language verbal ability will have difficulty 
completing the scale. Spanish-speaking children should use the Spanish version of 
the Piers-Harris-2. 

Factor analysis of the Piers-Harris basically confirmed the original factor struc- 
ture (Alexopoulos & Foudoulaki, 2002). Lower subscale reliabilities mean interpre- 
tation of profile strengths and weaknesses should be undertaken with caution 
(Coolcy & Ayres, 1988; Erford, 2006). The scale's question-and-response format has 



Personality Assessment 287 

been criticized by Strein (1995) because a Yes and No response format does not allow 
a child to indicate the degree of agreement or disagreement. Marsh and Holmes 
(1990) noticed many children struggling to respond accurately to questions that 
were scored in the negative (e.g., "My family is disappointed in me"), thus throwing 
into question the validity of some scores. 

The Piers-Harris-2 is cost-effective, time-efficient, and easy to use and yields re- 
liable and valid scores in the measurement of children's self-concept (Erford, 2006). 
Jeske (1985, p. 1 170) indicated the original Piers-Harris "appears to be the best chil- 
dren's self-concept measure currently available." This has not changed in the interim, 
as verified by Kelley (2005). 



Coopersmith Self-Esteem Inventories 



The Coopersmith Self-Esteem Inventories (Coopersmith, 1981) are individual- or 
group-administered questionnaires used to determine personal valuation of self 
(Peterson, 1985). The two forms (School Form and Adult Form) were developed 
based on the assumption that self-esteem is associated with effective functioning 
(Sewell, 1985). The School Form is a 58-item form used with students ages 8-15 
years. Built into the form is a Lie scale, which consists of eight questions that are 
scored separately from the self-esteem inventory. The Lie scale is used to determine 
defensiveness in the client's responses (Coopersmith, 1989). There is also a School 
Short Form that consists of 25 questions, on which the Adult Form is based. The 
Adult Form is used for clients over 1 5 years of age. The standardization sample in- 
formation is not adequate, but several researchers have collected supplemental sam- 
ples since the original inventory was standardized (Coopersmith, 1989). The relia- 
bility information indicates internal consistency coefficients ranged from 0.87 to 
0.92 for 4th- through 8th-graders for the total score (Sewell, 1985). Validity was re- 
ported as being sufficient, but conclusive evidence was not presented, and very little 
reliability or validity information is presented for the Adult Form. The Adult norm 
sample was composed of 226 college students from northern California, and the re- 
liability scores ranged from 0.78 to 0.85, but no further information was provided 
(Coopersmith, 1989). While internal consistency estimates appear to indicate the 
two forms may have some value as screening-level tests, the difficulty in defining and 
measuring the concept of self-esteem remains problematic. For example, according 
to the manual, there are no clearly defined criteria for determining low, medium, or 
high levels of self-esteem, although higher scores are indicative of higher self-esteem. 
The manual has a section for building self-esteem in students and provides some sug- 
gestions and techniques. Researchers are divided about whether to recommend the 
use of the inventory, but it is one of the most widely used measures of its kind 
(Peterson, 1985; Sewell, 1985). 



Tennessee Self-Concept Scale-Second Edition (TSCS-2) 



The Tennessee Self-Concept Scale — Second Edition (TSCS-2) (Fitts & Warren, 1996) 
is one of the most commonly used self-report measures of self-concept and can be 
used for children and adults. The test was standardized on 3,000 subjects, ages 7-90 



288 Chapter 8 

Table 8.5 Scales on the Tennessee Self-Concept Scale-Second Edition 



Self-concept scores Supplementary scores 



Physical Identity 

Moral Satisfaction 

Personal Behavior 

J 31 " 11 / Validity scores 

Social 

Academic/Work T 

Inconsistent 

Summary scores Responding 

Self-criticism 

Total self-concept Faking good 

Conflict Response distribution 



years, and can be administered to individuals or groups in about 10 to 20 minutes. 
The Adult Form is designed for clients ages 13 years or older and has 82 items. The 
Child Form is designed for students ages 7-14 years and has 76 items. A Short Form 
consisting of the first 20 items of either form can be used as well. Items comprise 15 
subscales and a total Self-Concept score (see Table 8.5). The items are rated on a 
5-point Likert scale ranging from Always False to Always True. The TSCS-2 can be 
hand-scored in approximately 10 minutes, or computer-scored (Western 
Psychological Services, 2003c). Reliability is adequate, with lower internal consisten- 
cies on subscales than Total Self-Concept, ranging from r = 0.73 to r = 0.93. Test- 
retest reliability scores ranged from r = 0.47 to r = 0.83 (Brown, 1998). Fitts and 
Warren (1996) reported acceptable levels of score validity for the TSCS-2. 



Think About It 8.2 Using the self-concept scales from Table 8.5, discuss 
with an acquaintance his or her levels of self-concept in each category. 
Notice whether there is consistency among the categories. What causes these 
consistencies or inconsistencies? 



PROJECTIVE APPROACHES TO ASSESSMENT 



In contrast to structured assessments of personality, which limit possible client re- 
sponses, projective assessments present clients with unstructured, ambiguous stim- 
uli and allow a virtually unlimited range of potential responses. Personality assess- 
ment using projective techniques is based on the projective hypothesis, the 
assumption that essential information about a client's personality characteristics, 
needs, conflicts, and motivations will be transferred onto ambiguous stimuli. 
Projective techniques are disguised and vague by design and provide clients only 
minimal instructions in order to reduce external structure and force clients to im- 
pose structure according to internal (intrapersonal) characteristics. 



Personality Assessment 289 

Projective personality assessment is based on the psychoanalytic notion of the 
unconscious, that portion of one's personality that is beyond awareness and control. 
According to Freud (1961, 1923, 1924), valuable understanding of one's true nature 
is obtained from the dark recesses of one's unconscious emotional and thought 
processes, not what is present or spoken from one's conscious mind. Freud also be- 
lieved in the prominence of drive and instinct, which lead one to gratify needs while 
reducing tension over unfulfilled needs. Freud's concept of psychic determinism — that 
every action undertaken is done so for a reason or particular purpose — is also a key 
to understanding personality. Altogether, then, Freud's psychoanalytic theory pro- 
poses that when a client is presented with ambiguous stimuli and asked to respond 
to the stimuli in some way (and there is not necessarily a right or wrong way to re- 
spond), the client cannot help but exhibit actions and responses driven by uncon- 
scious processes that reveal internal emotional or thought processes, representing 
needs and desires requiring expression and gratification. Therefore, the key is to de- 
velop techniques that will help clinicians gain access to a client's unconscious, allow- 
ing inferences to be made about the client's personality and personal adjustment. 
Such techniques are called projective techniques. 

If a professional counselor places a client in an unstructured, ambiguous cir- 
cumstance, the client will attempt to bring order and meaning to chaos. And how 
the client brings structure to the disorder yields valuable insights into the client's 
unconscious processes and serves as an indirect glimpse into the client's inner 
world. There are many projective techniques available for use by professional coun- 
selors, depending upon education, licensure, and professional training and expe- 
rience. These techniques vary in degree of standardization, with some having rather 
specific directions for administration and scoring. Often the interpretation of these 
techniques is less standardized, leading to subjective judgments based upon the 
professional counselor's theoretical orientation and clinical experience. Projective 
techniques are classified according to the nature of the ambiguous task and how 
clients are required to respond. The following five types of projective techniques 
represent a comprehensive categorization: (1) association techniques; (2) picture- 
story construction techniques; (3) verbal completion techniques; (4) choice 
arrangement techniques; and (5) production-expression techniques. 

The Rorschach Inkblot Testis an example of an association technique and is quite 
possibly the best-known projective test ever developed. The Rorschach is reviewed in 
greater detail below, but Figure 8. 1 presents a sample inkblot of the type included on 
the Rorschach. Proponents of association techniques propose that such procedures 
reveal details of the unconscious realm, similar to the way x-rays reveal the inner 
realm of the body. Clients project their inner organization onto the inkblot, and ex- 
aminers interpret these attempts to organize the vague stimuli. A second example of 
an association task is word association. For this task, examiners present a list of neu- 
tral (e.g., wood, spoon) or emotionally laden (e.g., father, sex) words one at a time, 
and the client responds with the first idea, image, or word that comes to mind. 
Examiners generally record the response; the amount of time required to respond 
(i.e., latency effects, with lengthier time periods supposedly revealing the degree of 
inner conflict/turmoil); and expressions of emotion while responding (e.g., anger, 
embarrassment). Responses to association technique stimuli are usually compared 



290 Chapter 8 




Figure 8.1 Sample inkblot 



with responses of nonclinical individuals to determine whether responses are "nor- 
mal" or pathological. Interpretation of themes and content categorizations is then 
conducted to reveal insights into personality functioning, inner needs, and conflicts. 
Picture-story construction techniques usually involve showing a client a pic- 
ture or other visual stimulus and requiring the client to construct a story about the 
picture. The stimulus pictures vary in terms of scenery, people, and social situations. 
The most commonly used construction technique is the Thematic Apperception Test 
(TAT). A sample picture stimulus similar to a TAT card is presented in Figure 8.2. 
The Children's Apperception Test (CAT) and Robert's Apperception Test for Children 
(RATC) are examples of picture-story construction techniques commonly used with 
children and adolescents. For Hispanic clients, another example would be the Tell 
Me a Story (TEMAS). Each of these tests is reviewed in greater detail below. The 
common strand through picture-story construction techniques is that the client is 
shown a stimulus picture and then asked to tell a good story about the picture. The 
story should describe what led up to the depicted scene, what is currently happen- 
ing, and what the likely outcome of the story will be. While some of the pictures 
may "pull" for different content and emotion, most are neutral and simply reflect 
the unconscious process of the client. In other words, the client is given no reason to 
tell a particular story about a given card in a particular manner. The assumption is 
that the story the client tells, and the manner in which the client tells it, reflect some 
inner need that surfaces in response to that given stimulus picture. In this way, 



Personality Assessment 291 



D-v 





Figure 8.2 Sample picture-story card 



clients convey inner thoughts and emotions and provide the content for clinicians to 
interpret and contextualize. 

Verbal completion techniques consist of verbal content presented in an incom- 
plete format, requiring the client to complete the stimulus. Sentence and story com- 
pletion tasks are among the more commonly used completion techniques. For ex- 
ample, a client may be presented with a sentence stem (e.g., "I think . . ." or "Other 
people treat me like . . .") and be asked to complete the stem. As with any projective 
technique, the client is given no reason to provide any specific response. The as- 
sumption is that some internal need, emotion, or thought is being expressed in the 
face of a vague, ambiguous stimulus (e.g., "I think dogs are cute" versus "I think men 
are horrid creatures," or "Other people treat me like a princess" versus "Other peo- 
ple treat me like I am invisible"). The Forer Structured Sentence Completion Test is a 
good example of this type of projective assessment and is reviewed in greater detail 
below. A story completion test presents the client with the start of a story and requires 
the client to finish the story. For example, the professional counselor may begin by 
saying, "A woman leans over to kiss a man on the cheek. The man suddenly pulls 
away and looks angry. Why?" The content of client responses is recorded verbatim 
and thematically analyzed. An example of a story completion task that also uses pic- 
tures is the Rosenzweig's Picture-Frustration Study (Rosenzweig, 1949), in which 24 
cartoons depicting a potentially frustrating situation are presented to a client. Each 
cartoon has a situation written in one of the "thought bubbles," and the other bub- 
ble is blank. The client indicates a verbal reaction (orally or in writing) to each stim- 
ulus. Responses are scored in one of three ways: (1) evasion of frustration; (2) frus- 
tration directed at other people or objects; or (3) frustration directed at self. 



292 Chapter 8 



Think About It 8.3 Construct 5 to 10 incomplete sentences and "ad- 
minister" them to several associates. What theme or patterns emerged? Do 
statements phrased in certain ways lead to certain more predictable results? 
How could you use projective techniques in your practice as a professional 
counselor. 



Choice arrangement techniques make up a diverse category, the commonality 
being that clients are given several to numerous options to rank-order or select from. 
Young children are often given the choice of which toys or dolls to play with in ther- 
apy Again, the child is given no reason to choose any given puppet, doll, or other toy, 
or to play with or tell stories using it in the particular manner he or she does. It is as- 
sumed that the child's selection and ensuing actions and verbalizations are the expres- 
sion of some inner motivation. Alternative choice arrangement projective techniques 
include arranging pictures or words along a like-dislike continuum or a multiple- 
choice response format designed for a Rorschach-Yike inkblot test. Of course, when an 
examiner uses a choice arrangement format, the examinee's potential range of choices 
becomes restricted. In some ways, this defeats the purpose of a projective technique, 
which is to allow clients maximum leeway to respond from the unconscious. 
Importantly, research supporting the use of choice techniques for assessment is very 
sparse when compared with that available for other types of projective assessment. 

Production-expression techniques require clients to actively participate in the 
assessment by creating some product that can be analyzed and interpreted to reveal 
facets of the client's personality. Commonly used techniques include drawings (e.g., 
House-Tree-Person, Human Figure Drawing, Kinetic Family Drawing, Kinetic School 
Drawing), painting or coloring, or a dramatic performance (e.g., psychodrama). 
Drawing techniques are by far the most commonly used assessment devices from 
this category. Importantly, how clients act and respond to verbal queries while engag- 
ing in this task is just as important as any characteristics of the final product, and 
professional counselors using these techniques are strongly encouraged to observe, 
and ask follow-up questions of, clients creating expression products. When using a 
drawing technique, such as the Human Figure Drawing, clients are usually given a 
blank sheet of paper and pencil (or pens, colored pencils, crayons, etc.) and asked to 
draw a picture of a person. Interpretation of these drawings varies widely, depending 
on the professional counselor's theoretical orientation, training, and focus. 

Some test manuals and textbooks offer specific guidance for interpreting draw- 
ing characteristics, or even specific objects within a drawing. For example, aggres- 
sion may be indicated by heavy, dark lines; low self-esteem may be indicated by a 
small drawing. Handler (1996) suggested that particular attention be paid to era- 
sures, placement of the figure on the paper, too much or too little detail, shading and 
heavy or pressured lines, among other things. Of critical importance is that examin- 
ers not give too much emphasis to any one sign. Also, the professional counselor 
should never rely solely on the drawn product for interpretive insights. It is excellent 
practice to query the clieni about a drawing in order to understand what the draw- 



Personality Assessment 293 

ing might represent to the client. The best use of drawing characteristics and behav- 
iors is for generating hypotheses to be tested out using more structured and system- 
atic methods. Figures 8.3 through 8.5 display examples of various projective draw- 
ing techniques. 




/' 



>, 








) 
> 

/ 




■r 




s 

J 




<'"I\\J* 




Figure 8.3 House-Tree-Person drawings by a selfconscious, perfectionistic 
teenage girl 



294 Chapter 8 





Figure 8.4 Kinetic Family Drawing by a 12-year-old boy with a fine-motor 
Coordination Disorder and AD/HD— Predominantly Inattentive Type 




Figure 8.5 Kinetic School Drawing by a 12-year-old boy with a fine-motor 
Coordination Disorder and AD/HD-Predominantly Inattentive Type 



Personality Assessment 295 



Strengths and Weaknesses of Projective Techniques 



Projective techniques have a number of noteworthy positive points and have 
remained popular over the past half century (Bellak, 1992; Piotrowski & Zalewski, 
1993; Watkins, 1991). Some clinicians believe that projective techniques are great 
icebreakers and rapport builders when beginning an evaluation or counseling 
relationship with children or adolescents, because these techniques are generally 
perceived as nonthreatening, and clients need not worry about whether a particular 
answer was right or wrong. Clients generally are not limited in the number or type 
of responses they can make. This allows the unconscious processes maximum leeway 
in projecting inner needs and motivations onto the stimulus. Also, because clients 
are not generally familiar with the scoring and interpretive strategies of projective 
techniques, many clinicians believe responses to projective tests are more difficult to 
fake than for structured tests, although this is not necessarily the case (Masling, 
1960). 

Projective techniques may have valuable cross-cultural applications, especially 
when the stimulus involves inkblots, drawings, or brief verbal stems. Most projec- 
tives require no or very little reading ability, so they may be helpful in the assessment 
of young clients and clients with poor literacy skills. Likewise, because some projec- 
tive techniques require a minimum of verbal input and output, they may be helpful 
techniques for use with young clients, clients from diverse cultures, or clients with 
speech and language disorders. Finally, because projective techniques are based on 
psychoanalytic theory, complex, multidimensional themes may emerge and provide 
valuable insights into the client's personality. 

Projective assessment techniques also have numerous limitations. Projective 
techniques must be administered individually by highly educated and trained indi- 
viduals and therefore are expensive to administer, score, and interpret. Subjective 
scoring and interpretive procedures make results difficult to replicate. Interpretation 
is often the most subjective part of the process. Indeed, many projective devices ap- 
pear to allow wide-ranging judgments on the part of the examiner when scoring and 
interpreting a client's results. 

Subjectivity in scoring and interpretation inevitably leads to concerns over reli- 
ability and validity of scores. Indeed, projective techniques display poor psychomet- 
rics. Scorer reliability, test-retest, and internal consistency coefficients tend to be un- 
acceptably low. As stated earlier, low reliability leads to low score validity, and the 
research on projective score validity is, at best, inconclusive (Anastasi & Urbina, 
1997). 

Most projective tests have either absent or inadequate norms. When norms are 
provided, the samples are often described in vague terms. In addition, often the com- 
parison groups are not normal samples, but clinical populations, negating a valuable 
comparison group for the determination of potential pathology; that is, if a client's 
responses are compared with clinical patients and not "normal" individuals, how can 
a clinician decide whether the client's responses are normal? Still, projective tech- 
niques help to "flesh out" our understanding of clients in an open-ended manner 
that is often missing in objective personality inventories. 



296 Chapter 8 



Projective techniques have been shown to be susceptible to outside influences, 
such as examiner characteristics, examiner bias (i.e., theoretical orientation), or vari- 
ations in administration directions. In addition, the validity of the "projective hy- 
pothesis" itself has been called into question because responses may reflect state-de- 
pendent characteristics rather than enduring personality characteristics. This is a 
critical point, because the whole idea behind projective assessment is to access the 
unconscious in order to understand the client's psychic determinism. If the client's 
"present state of mind" is being measured rather than some enduring personality 
structure, the goal of accessing the unconscious processes of the client's personality 
is thwarted. The final limitation involves the difficulty (or impossibility) of actually 
scientifically studying Freud's psychoanalytic developmental theory, given its empha- 
sis on unconscious psychological processes. As psychoanalytic theory forms the basis 
of projective testing, this limitation is quite significant. 

As a final comment on projective techniques, Anastasi and Urbina (1997) sug- 
gested projective techniques are better used as clinical tools rather than as tests per 
se. Given the low standard of psychometric rigor, such a guarded approach is war- 
ranted. Projectives are quite helpful when used for hypothesis generation and for 
helping clients gain insight into unconscious needs and motivations, as well as aids 
for qualitative interviewing, but their technical limitations mitigate against use for 
diagnostic purposes. 



SOME COMMONLY USED PROJECTIVE TECHNIQUES 
Rorschach Inkblot Test 



The Rorschach Inkblot Test (Rorschach, 1921/1998), originally developed by Hermann 
Rorschach in 1921, is the best-known and most used projective test. The test's pur- 
pose is to assess how a client perceives and organizes thoughts about the world. The 
test is a Level C instrument and is individually administered to clients ages 5 and 
older, in about 20 to 30 minutes (Hess, Zachar, & Kramer, 2001). It consists of 10 
plates of bilaterally symmetrical inkblots (Janda, 1998): 5 are black and white; 2 are 
black, white, and red; and the remaining 3 are comprised of pastel colors (Hess et al., 
2001). Clients are presented with the cards and asked what they think of the inkblot 
or what it might be. In the second part of administration, clients are asked to explain 
their original answers. Scoring and interpretation are frequently completed using a 
scoring system originally developed by John Exner in the 1970s called the 
Comprehensive System for Administering, Scoring, and Interpreting the Rorschach (Exner, 
2002). Exner's multifaceted system involves interpretation of three aspects of re- 
sponses: Location (W for the entire blot, D major portion of the blot, and Dd for un- 
common responses); Determinants (there are nearly two dozen having to do with 
shape, activity of humans, chromatic features, etc.); and Content (there are 26 cate- 
gories used to interpret the content of the story). A Structural Summary is composed 
based on an interpretive rating scale developed by Exner (Janda, 1998). 

As with many projective tests, it is often hard to find concrete empirical data on 
the Rorschach. Subjectivity is such a part of interpretation, and there can be definite 



Personality Assessment 297 

diversity in administration procedures depending on testing purpose and clinician 
training. It has been noted that well-trained users of Exner's scoring system agree on 
the major variables over 88% of the time (Hess et al., 2001). Still, there is substan- 
tial debate over the interrater reliability of Exner's system. Exner purports that test- 
retest reliability estimates are at or above r = 0.70 at both 1-year and 3-year intervals. 
According to Hess et al. (2001), validity data of the Rorschach also yield many ques- 
tions and concerns. Various questions of subjectivity arise based on administration, 
scoring, and interpretation procedures. Still even with the lack of standardization 
and empirical data, the Rorschach used in conjunction with Exner's Comprehensive 
System (2002) is a better personality test than most opponents will acknowledge 
(Hess et al.). Critics of the Rorschach point out that statistical prediction is usually 
more accurate than clinical prediction (i.e., judgment), and the Rorschach relies pri- 
marily on clinical prediction to measure personality. Far more psychometric research 
needs to be done using the Rorschach, but it has the potential to generate meaning- 
ful personality data (Hess et al., 2001). 



Thematic Apperception Test (TAT) 



The Thematic Apperception Test (TAT) (Murray & Bellak, 1973) is used to measure 
various aspects of a client's personality. Clients are presented with 3 1 picture cards 
and are asked to create stories based on the images. There is no time limit for this as- 
sessment, and it can be administered to children and adult clients. Specific scoring 
criteria are provided in the scoring protocol and assessment booklet. Many admin- 
istrators choose 8 to 1 2 cards to use with a client. Six elements are considered when 
examining stories: (1) the hero; (2) the needs or motives and feelings; (3) presses or 
environmental forces; (4) outcomes; (5) recurring themes in the story; and (6) inter- 
ests and sentiments (Janda, 1998). 

According to Janda (1998), although several clinicians have determined new 
scoring criteria for the TAT, most adhere to Murray's original scoring format. Janda 
reported that this method can often be unstructured and biased, leading to inade- 
quate score reliability and validity. 



Children's Apperception Test-1991 Revision (CAT) 



The G47"(Bellak & Bellak, 1992) assesses personality by interpreting story responses 
to presented picture stimuli. The CATls administered to children ages 3-10 years in 
about 15 to 20 minutes. The child is presented with stimulus cards that show ani- 
mals engaged in human relationship— oriented interactions. The client then gives per- 
ceptions, interpretations, and responses, and must solve developmental problems 
(Knoff, 1998). The 10 stimulus cards address the following: feeding problems; oral 
problems; sibling rivalry; attitudes toward parents; relationships to parents as (sexual) 
couples; jealousy toward same-gender parent figures; fantasies about aggression; ac- 
ceptance by the adult world; fear and loneliness at night; and toileting behavior and 
parents' responses to it. There are 10 variables that are used to analyze responses: 
Main Theme; Main Hero; Main Needs and Drives of the Hero; the child's 



298 Chapter 8 



Conception of the Environment; how the child sees and reacts to the figures in the 
cards; Significant Conflicts described; the Nature of the Child's Main Anxieties; the 
Child's Main Defenses; the Adequacy of the Superego as Manifested by Punishment 
for Crime; and the Integration of the Child's Ego (Knoff, 1998). The assessment 
comes with 10 additional cards that can supplement the G4r(Reinehr, 1998). 
Specific scoring and interpretive instructions are included in the interpretive manual 
(Knoff, 1998). 

The authors state that there is no need for standardization or empirical data for 
a projective test like the CAT, and few specifics are provided in the manual (Bellak 
& Bellak, 1992). Due to the lack of statistical data, clinicians should be careful not 
to base any clinical diagnosis or intervention on this assessment (Knoff, 1998). 
Reinehr (1998) agreed that there is no basis in the argument of no need for empiri- 
cal data on projective assessments. 



Roberts Apperception Test for Children-Second Edition 
(Roberts-2) 



The Roberts-2 (McArthur & Roberts, 1994) is a projective test designed to measure 
children's social perceptions. The test can be administered to children ages 6-15 
years in about 20 to 30 minutes. The child is presented with 16 different test pictures 
and is asked to tell a story about each one. Scoring criteria for each picture are pre- 
sented in the manual and based on the presence or absence of certain characteristics 
in the narrative. The three scales measured are Adaptive, Clinical, and Clinical 
Indicators (Cosden, 2001). There are seven main constructs on which scoring crite- 
ria are based, and each has several subconstructs. The seven main constructs are: 
theme overview, problem identification, outcome, available resources, emotion, res- 
olution, and unusual or atypical responses. According to the test's publisher, new 
standardization studies were conducted and conformed to U.S. population demo- 
graphics in terms of gender, ethnicity, and parental education, although specific in- 
formation about the sample is not provided (Cosden, 2001), so generalizability is 
questionable. 

Although minimal information is available online for the Roberts-2, the manual 
contends that validity for derived test scores is adequate. However, Waller (2001) as- 
serted the original version of the test relies too heavily on doctoral dissertations and 
findings are not published in refereed journals, making it difficult to evaluate score 
validity. A new version of the Roberts-2 (McArthur & Roberts, 2005) became avail- 
able in 2005. 



House-Tree-Person (H-T-P) Projective Drawing Technique 



The House-Tree-Person (H-T-P) Projective Drawing Technique (Buck, 1964) is a 
widely used projective test that is easy to use and time-efficient (Western 
Psychological Services, 2003b). It can be used lor clients ages 3 and older (see Figure 
8.6). The client draws three objects (a house, a tree, and a person) and then de- 
scribes, defines, and interprets the drawings. House-Tree-Person is often used as the 



Personality Assessment 299 






Figure 8.6 House-Tree-Person drawings by a teenager with AD/HD and fine- 
motor coordination difficulties 



300 Chapter 8 



first test in an assessment for a counseling session, because drawing tends to reduce 
tension. It is useful for assessing personality in people from different cultures, those 
deprived of educational opportunities, and those developmentally delayed or non- 
English-speaking; in addition, it is highly sensitive to the early presence of psy- 
chopathology (Western Psychological Services, 2003b). Examiners must always be 
careful to validate observations from projective techniques through other assessment 
methods and not to overinterpret meanings of specific objects or designs drawn in a 
picture. 



Kinetic Drawing System for Family and School (KDS) 



The Kinetic Drawing System for Family and School (KDS) (Knoff & Prout, 1985) is 
designed to individually assess the frequency of a child's difficulties in the home and 
school settings. The format allows the examiner to understand the overlap of behav- 
iors and attitudes in both settings as well as to assess the source of certain attitudes 
and behaviors. The KDS can be administered to clients ages 5-20 years. Clients are 
asked to draw separate pictures of both family and school situations. Examiners are 
asked to stress that each person in the picture should be doing something. There is 
no time limit for this task but most complete the task in 20 to 40 minutes. Pictures 
are assessed based on five categories: (1) actions of and between figures; (2) figure 
characteristics; (3) position and distance of figures, and barriers between them; (4) 
style; and (5) symbols (see Figures 8.7 and 8.8). 

In a review of the manual, Cundick (1989) concluded reliability and validity 
data are inadequate and that the studies provided are not related to the test protocol. 
Weinberg (1989) stated that if administrators are well trained and scoring criteria 
are clearly defined, good interrater reliability coefficients can be attained; however, 
test-retest reliability coefficients are low. Weinberg concluded that although this test 
is a wonderful icebreaker and rapport-building tool, one cannot recommend this as 
an interpretive assessment yielding reliable and valid scores. 



Forer Structured Sentence Completion Test (FSSCT) 



The Forer Structured Sentence Completion Test (FSSCT) (Forer, 1967) is a 100-item 
test used to determine a clients attitudes and views of the world by finding out in- 
formation about a client's relationships and dynamics, and the client's use of evasive- 
ness, individual differences, and defense mechanisms (Western Psychological 
Services, 2003a). Separate forms are available for men, women, adolescent girls, and 
adolescent boys. Administration of the test takes about 15 to 20 minutes and re- 
quires a Level B qualification. A Checklist and Clinical Evaluation Form provides 
evaluation tools that help the examiner to group clients into one of four categories: 
(1) Interpersonal Figures; (2) Wishes; (3) Causes of Own (feelings and behaviors); 
and (4) Reactions (to others) (Benet, 2005). Reliability, validity, and normative in- 
formation is not given in the manual. Example prompts might include "My lather 

makes me feel "; "I like to talk to my friends about "; Others often 

think that I ." 



Personality Assessment 301 




Figure 8.7 Kinetic Family Drawing by a nonclinical teenage girl 





Figure 8.8 Kinetic School Drawing by a nonclinical teenage girl 



302 Chapter 8 



SUMMARY/CONCLUSION 



This chapter has provided an introduction to the information that professional 
counselors need to engage in personality assessment. Both objective and projective 
personality assessment were addressed. Objective methods typically involve trait ap- 
proaches, and the five-factor model of Costa & McCrae currently enjoys popularity 
among personality researchers. Numerous structured personality inventories are 
available for use by professional counselors, including the NEO PI-R, CPI, PAI, and 
MBTI. 

Projective assessments present clients with ambiguous stimuli, and professional 
counselors observe and assess how clients construct meaning and respond to these 
stimuli. Projective techniques generally yield lower score reliability and validity than 
objective personality measures. Projective techniques can be classified as association, 
picture-story, verbal completion, choice arrangement, and production-expression 
techniques. 



KEY TERMS 



association technique 

choice arrangement techniques 

drawing technique 

personality 

personality assessment 

picture-story construction techniques 



production-expression techniques 

projective assessment 

projective hypothesis 

traits 

verbal completion techniques 




CHAPTER 



9 



Behavioral Assessment 



by Carl J. Sheperis, R. Anthony Doggett, Masanori Ota, 
Bradley T. Erford, and Carol Salisbury 



This chapter provides a general understanding of behavioral assessment proce- 
dures for professional counselors. More specifically, the chapter provides a gen- 
eral definition of behavioral assessment as well as specific guidelines for con- 
ducting behavioral assessment; details the two kinds of behavioral assessment (direct 
behavioral assessment and nondirect behavioral assessment) and common tech- 
niques used within these two assessment categories; and gives a brief overview of the 
most commonly used behavioral assessment instruments. 



WHAT IS BEHAVIORAL ASSESSMENT? 



When children talk out loud during a class or see others become aggressive and rush 
to fight, professional counselors may raise the following questions: Why does this 
behavior occur? How can the behavior be changed? Behavioral assessment is a use- 
ful methodology to clearly answer these questions. 

Behavioral assessment is generally defined as "the identification of meaningful 
response units and their controlling variables for the purposes of understanding and 
of altering behavior" (Nelson, 1985, p. 45). Because a behavior occurs through an 
interaction between an individual and the person's environment, professional coun- 
selors use behavioral assessment to evaluate a particular behavior and the context in 
which it occurs (e.g., stimuli or events affecting the behavior). Behavioral assessment, 
along with other traditional assessment approaches (e.g., intelligence tests, personal- 
ity tests), is widely used in various applied settings, such as schools, counseling cen- 
ters, and other clinical venues. 



303 



304 Chapter 9 



Defining Behavior 



From a behavioral standpoint, all behaviors are seen as a direct result of external and 
environmental stimuli. Although behaviors can be indicators of internal difficulties, 
the professional counselor cannot readily measure or see those internal struggles. 
Thus a key concept in behavioral assessment is that the target behavior (i.e., the be- 
havior the client is trying to change) must be directly observable. For example, mil- 
lions of people struggle to lose weight each year, and new diets emerge on the best- 
seller list all the time. While the professional counselor may personally know what it 
is like to have an internal battle over whether to eat a certain dessert, it would be 
hard for a bystander to see or measure that internal struggle in a client. However, 
through behavioral assessment, the professional counselor can identify a certain be- 
havior that the client is trying to change (i.e., snacking on high-fat foods), measure 
the number of times that the client snacks, the amount of food that is consumed, 
and the amount of weight that is gained or lost. The professional counselor can then 
develop an intervention that is clearly tied to the target behavior and accurately 
measure changes in the behavior. 

To obtain a clear picture of what the professional counselor and client are try- 
ing to accomplish, an operational definition of a target behavior is addressed at the 
beginning of behavioral assessment, using observable and measurable terms. A well- 
developed operational definition contains an objective, concrete, and quantitative 
description, with which anyone can clearly identify the observed behavior. In other 
words, an operational definition must pass the "stranger test" — that is, any behavior 
that one defines should be clearly understandable to a stranger. That stranger should 
be able to pick up the definition and be able to observe someone without difficulty. 
For example, it is not observable or measurable to state, "Sam continually snacks on 
inappropriate foods," because it is not clear what "inappropriate" and "continually" 
specifically mean in this situation. However, it is much clearer if an inappropriate 
food is defined as "any food item containing more than 10 grams of carbohydrates," 
or "any food item containing more than 5 grams of fat." A good operational defini- 
tion must also pass the "dead man test" — that is, the target behavior should not be 
something that only a dead man could do. If a professional counselor developed an 
intervention plan with the goal that Sam would not eat, that counselor would prob- 
ably lose his or her license or be sued. It is impossible to ask someone not to eat. The 
person would have to be dead to follow this guideline. In short, behavioral goals and 
objectives should be MOP&D: measurable, observable, positive, and doable. Thus 
an operational definition is crucial to minimize inferences during observation 
(Sattler, 2002). To obtain reliable and valid data, it is important to maintain the same 
operational definition throughout the assessment process. 



Think About It 9.1 What behavior in your life would you like to 
change? How could this behavior be operationally defined? Using this defini- 
tion, what new behavioral goal could you set? 



Behavioral Assessment 305 



Guidelines for Conducting Behavioral Assessment 



It should be noted that the professional counselor does not target personality traits 
or psychopathology through behavioral assessment, because these things cannot 
readily change through intervention. For example, a professional counselor can 
change the frequency that a child displays tantrums (behavior) but cannot change 
autism (a disorder), which some people might think causes the tantrums. Thus, 
through behavioral assessment, the professional counselor focuses on the function 
of particular behaviors that are within the client's voluntary control rather than a 
diagnosis. 

Behaviors often stem from interactions between an individual and the individ- 
ual's environment. Thus, instead of examining a behavior in isolation, the profes- 
sional counselor must consider environmental variables affecting the behavior (e.g., 
place, people, time, stimulus). Antecedents and consequences (events preceding and 
following a behavior, respectively) and the characteristics of behavior (e.g., function, 
magnitude, frequency, rate, duration, latency) are often measured in behavioral as- 
sessment. For example, a great deal of attention has been focused on school violence 
in recent years. On April 20, 1999, Eric Harris and Dylan Klebold killed a teacher 
and 12 other students, wounded 23 other people, and then killed themselves at 
Columbine High School in Littleton, Colorado. While it is clear that both students 
were disturbed, it is important to understand the environmental variables and an- 
tecedents leading to this tragedy. Harris kept a journal that helps us to understand 
the environment's influence on his behavior. According to USA Today's online web- 
site (Killer's diary reveals plans, 2001), Harris's journal paints a picture of an isolated 
teen who was angry about being rejected. In his journal, Harris wrote, "I hate you 
people for leaving me out of so many fun things. . . . You people had my phone #, 
and I asked and all, but no no no no no don't let the weird looking Eric kid come 
along." Because one can now look at some of the ways that rejection and isolation 
affected Harris's behavior, schools across the country have implemented both pre- 
ventive (e.g., peer counseling) and response measures (e.g., school safety plans). If 
we only look at Harris and Klebold as disturbed teens and ignore the environmen- 
tal factors leading to the tragedy, we would be unable to prevent future crises of this 
nature. 

In conducting behavioral assessment, it is also important to know that every be- 
havior has its own purpose or function. When behavioral assessment is used to iden- 
tify a function, it is called functional behavioral assessment (FBA). In accordance 
with the Individuals with Disabilities Education Act Amendments of 1997, FBAs 
and behavior plans are specifically required in schools for children who have a spe- 
cial education ruling and are subject to disciplinary action. 

Applied behavioral analysis researchers have identified four main variables that 
may maintain or reinforce the performance of target behaviors: (a) attention, (b) tan- 
gible, (c) escape, and (d) sensory stimulation (Alberto & Troutman, 2003; Iwata et 
al., 1994). It should be noted that even if the topographies (i.e., what a behavior 
looks like) of two behaviors are the same, the functions of the two behaviors might 
be different. For example, when a child screams more after a teacher says, "Be quiet 



306 Chapter 9 



and look at me," the function may be attention from a teacher. However, escape may 
be the function if a child often screams when difficult academic tasks are given dur- 
ing a class. Also, one behavior may have more than one function (e.g., the functions 
of the child's screaming may be both teacher attention and escape from difficult 
tasks). Thus, once a function is hypothesized in functional behavioral assessment, it 
should be experimentally verified through functional analysis using a single-subject 
research design. Functional analysis is an experimental manipulation of environmen- 
tal variables (e.g., antecedents, consequences) to establish a functional relationship 
between a behavior and environmental variables. Discussion of functional analysis 
and single-subject design are beyond the scope of this chapter, so interested readers 
are referred to Alberto and Troutman (2003) and Miltenberger (2004). 



METHODS OF BEHAVIORAL ASSESSMENT 



Direct Assessment 



Behavioral assessment is divided into two categories: direct assessment and indirect 
assessment. In direct assessment, the professional counselor assesses events occur- 
ring here and now through direct observation and client self-monitoring. In indirect 
assessment, the professional counselor assesses past events using behavioral inter- 
views, and self-report and informant-report behavioral checklists and rating scales. 



Through direct observation, a professional counselor observes a client's behavior in 
a natural setting and records it using a recording sheet. For example, a professional 
school counselor may observe a child to assess how many times the child leaves the 
seat during a class or talks to friends during a physical education period or recess on 
the playground. Behaviors are often recorded using the following four methods: (1) 
narrative recording, (2) interval recording, (3) event recording, and (4) ratings 
recording (Sattler, 2002). This discussion is limited to the two most prominent 
methods: narrative and interval recording. 

Narrative recording 

In narrative recording (see Table 9.1), the professional counselor records what is ob- 
served anecdotally. The professional counselor may observe not only a behavior, but 
also antecedents and consequences. Such observation, called ABC narrative recording 
(for antecedent, behavior, and consequence), is used to identify relationships be- 
tween a behavior and environmental variables (Bijou, Peterson, & Ault, 1968). It 
can be useful to add an additional category to narrative recording: function. While 
it is important to know the antecedents, behaviors, and consequences, it is equally 
important to determine the functions of a behavior. 

Interval recording 

There are three primary methods of interval recording: (1) whole-interval record- 
ing; (2) partial-interval recording; and (3) momentary time sampling. In each in- 
terval recording method, the recording time is equally divided into intervals (e.g., 



Behavioral Assessment 307 

Table 9.1 ABC narrative, observation format 

A A wife asks her husband to help with the household chores. 

B Husband pouts (i.e., speaks in short sentences, complains about the task, moves 

slowly during the task). 
C Wife tells husband, "Forget it. I'll just do the chores." 

F Husband sought to escape task. 



10-second intervals), and an observer records if a behavior occurs during each in- 
terval. Specifically, in whole-interval recording, an observer marks each interval on 
a recording sheet whenever a behavior occurs Throughout the interval, whereas in 
partial-interval recording, an observer marks each interval whenever a behavior oc- 
curs at least once anytime in the interval. In momentary time sampling, an observer 
marks each interval whenever a behavior occurs at the beginning or end of the in- 
terval. It should be noted that the occurrence of a behavior may be underestimated 
in whole-interval recording, whereas it may be overestimated in partial-interval 
recording. 

Although direct observation demonstrates clear descriptions of behavior, its 
characteristics, and environmental variables, some cautions are necessary. First, an 
observer may be biased. For example, if a professional counselor is attending to more 
than one behavior simultaneously, the professional counselor may pay more atten- 
tion to some of the behaviors, but may miss others. Furthermore, because of habit- 
uation, the observer may unintentionally change the operational definition or crite- 
rion of a behavior (e.g., criterion frequency or duration), a factor called observer drift. 
To prevent observer drift, interobserver agreement should be checked (for each type 
of interobserver agreement and its calculation, see Kazdin, 1982). Also, an observer 
should have periodic trainings to recall the operational definition, criteria of a be- 
havior, and observation procedures. 

Second, clients may change a behavior if they know they are being observed, a 
factor called reactivity. For example, if children know they are being observed to de- 
termine the frequency of talking without permission during a class, they may try to 
remain quiet and follow the classroom rules. Clearly, in this case, an observer cannot 
obtain data truly reflecting the behavior (i.e., talking without permission). An ob- 
server may reduce reactivity by staying in the observation setting several times be- 
fore recording observation data so that people become habituated to the observer. 
With cautions to the potential pitfalls associated with observation procedures, direct 
observation is often able to clearly draw the whole picture of behavior in natural set- 
tings. Table 9.2 provides an example of an interval recording observation with rele- 
vant operational definitions. 

Self-monitoring 

Self-monitoring is a method by which clients can observe and record their own be- 
havior. Self-monitoring is an effective way to monitor infrequent behaviors (e.g., 
binge eating, self-injury) and internalizing problems (e.g., negative thoughts, anxiety, 



308 Chapter 9 

Table 9.2 Sample interval recording sheet with relevant operational definitions 

Sample interval recording sheet 

Behavior 10 min. 20 min. 30 min. 40 min. 50 min 60 min 

Antecedents 

Targets 

Consequences 

Operational definitions for interval recording sheet 

ANTECEDENTS 

D: Demand — Instruction to complete educational work or an assignment given to complete ("Get to work," "Turn your books to 

page . . . ," teacher hands out a worksheet). 
C: Command — Behavioral instruction ("Sit down," "Be quiet," "Go to your desk," "Stop talking," "Look at me"). 
T: Transition — Moving from one location to another in the classroom or school, switching from one assignment to another 

(walking from the classroom to the lunchroom, moving from a desk to the reading area, switching from a math assignment 

to a spelling assignment). 

TARGET BEHAVIORS 

OT: Off-task — Student's eyes are not directed toward the teacher for more than 5 seconds during a lecture, instruction, or 

assignment. 
OS: Out-of-seat — Student's bottom breaks contact with the seat or floor for more than 5 seconds. 
IV: Inappropriate vocalizations — Student talks to teacher or peers without permission, student argues with teacher or peers, 

student makes noises (whistling, howling, humming, clicking sounds). 

CONSEQUENCES 

El A: Escape! avoidance — Student is allowed to refrain from working on or completing the assignment, teacher takes assignment 
away, teacher does not make student comply with (follow through on or complete) a command. 

Teacher Attention 

//' Teacher positive attention — Smiles, praise statements, proximity following appropriate behavior, physical touch for appropriate 

behavior (pat on the shoulder, "Good job"). 
IN: Teacher negative attention — Frowns, reprimands, redirections, interruptions, proximity following problem behavior, physical 

touch fot problem behavior ("Stop it!" "How many times have I told you to . . . ," tap on shoulder for talking without 

permission). 

Peer Attention 

/'/'. Peer positive attention — Smiles, praises, proximity, physical touch for appropriate behavior. 

I'N: Peer negative attention — Frowns, put-downs ("You're so . . . "), name calling ("dummy, butthead"), proximity following 
problem behavior, physical touch following problem behavior (pushing, hitting, kicking, touching). 

Calculation of Performance of Behavior From Interval Recording Sheet 

OT: + 60x100= % of the intervals 



OS: 



60 x 100= % of i he interval 



IV: + 60x100= % of the intervals 

loi.il Disruptive Behavior: + 180 x 100= % of the intervals 



Indirect Assessment 



Behavioral Assessment 309 

fear), which are difficult for others to observe (Sattler, 2002). For example, a client 
who has depression may record any negative thoughts (e.g., "Although I study hard, 
I am not smart enough to pass this course") every 30 minutes for a certain number of 
hours. There are two matters to consider for self-monitoring. First, training a client to 
effectively monitor behavior is critical, because the client needs to identify a target be- 
havior precisely and record it appropriately (e.g., every 1 minute). Second, to increase 
accuracy, it is effective for a professional counselor to monitor a client's behavior si- 
multaneously and subsequently compare data with the client's self-monitoring. Also, 
periodic feedback regarding procedures and accuracy of self-monitoring may further 
promote accuracy. 

The behavioral interview 

The purpose of a clinical interview is to assess a client's global problems and related 
history (e.g., family, medical, psychological, educational) for the purpose of arriving 
at a diagnosis (Gresham, 1984). In contrast, the purpose of a behavioral interview 
is to identify a target behavior; to analyze environmental variables affecting the be- 
havior; and to plan, implement, and evaluate an intervention. Thus a behavioral in- 
terview is a solution-focused interview that links assessment to intervention. A pro- 
fessional counselor may interview not only a client, but also significant others (e.g., 
parent, caregiver, spouse, employer, teacher, peer) to obtain multidimensional infor- 
mation about a client's problems from each individual's perspective. For example, a 
wife may report that her husband appears distracted and depressed at home, but 
peers may report that the man is upbeat and active at work. Further information 
from the client's children reveals that the parents have been arguing more over the 
last few months. While the root of the problem is not completely clear yet, it can be 
determined that the man's behavior is limited to one setting. Thus a professional 
counselor can now focus further assessment efforts around the marital relationship 
and design more effective interventions because of the multidimensional informa- 
tion derived from the interview. 



Because of their brevity, self-report and informant-report behavioral checklists and 
rating scales are commonly used methods of indirect assessment. In a self-report, a 
client may either respond to written questions or directly answer a professional coun- 
selor's questions regarding the nature of the client's concerns. However, in an inform- 
ant report, significant others provide their perspective of the client's problems. For 
example, using a self-report, a professional counselor may ask clients to rate the qual- 
ity of their relationships with immediate family members. Through an informant re- 
port, the professional counselor would ask significant others in the client's life to rate 
the client's relationships with immediate family members. While these questions are 
essentially the same, the results could be vastly different. Thus it is very useful to 
compare the results of a self-report and an informant report 

As in a behavioral interview, eliciting useful information in a self-report or 
informant report often depends on the professional counselor's skills of verbal 



310 Chapter 9 



communication and strategic questions. For some clients, such as children or in- 
dividuals with disabilities, the informant report plays an especially important role 
in obtaining useful information on the client's problems. However, for reasons of 
confidentiality, the client's consent is necessary before obtaining an informant re- 
port. The professional counselor should be aware that responses on a self-report or 
an informant report might not reflect actual problems precisely, because the re- 
sponses represent human memories of past events. Intentionally or unintention- 
ally, some clients or significant others may over- or underreport the severity of the 
client's problems. 

Behavioral checklists and rating scales offer a more standardized means of indirect 
assessment and often have both self-report and informant-report versions available. 
Many of the typical checklists and rating scales have a Likert scale format (i.e., rate 
a behavior on a scale of 1 to 5) or some variation of this response style. For example, 
to the statement "I did not sleep last night," there may be three response choices, 
where represents Not at All, 1 represents Somewhat True, and 2 represents Very 
True. While direct observations and interviews provide reliable information, stan- 
dardized rating scales can provide normative information allowing professional 
counselors to compare results of an individual to the population for which the in- 
strument was developed. 

When using rating scales or checklists, professional counselors should be cau- 
tious of halo effects (e.g., tendency to rate a high-performing student as well-behaved 
regardless of actual behaviors observed), and central tendency error (e.g., tendency to 
respond with moderate or centrist descriptions rather than toward the extremes of a 
rating scale). For example, some people may respond more mildly or severely than 
their actual level (e.g., they may choose a number between 2 and 4 on a 5-point 
Likert scale). Clients may respond this way because they are embarrassed about cer- 
tain symptoms, have ulterior motives for representing themselves in a more positive 
light, do not really understand the questions being presented, do not have the self- 
awareness to respond accurately, or view the extreme rating choices as very extreme. 
Thus, as is the case with any aspect of the assessment process, it is important for the 
professional counselor to provide clear instructions, adequate details about the pur- 
pose of the assessment, and information about the instrument, and to answer any 
questions the client or informant may have. Despite the potential weaknesses with 
behavioral checklists and rating scales, they are easy, inexpensive, and not time con- 
suming. Also, some have been shown to reliably and validly screen or identify spe- 
cific areas of disorders. For example, the Child Behavior Checklist/6-18 (CBCL/6-18) 
(Achenbach & Rescorla, 2001) assesses the behavioral problems and adaptive func- 
tioning of children ages 6 to 18 years. The CBCL/6-18 has 1 18 specific problem 
items (and an additional 20 competence items). Each item consists of a 3-point scale, 
on which 2 represents Very True or Often True, 1 represents Somewhat or 
Sometimes True, and represents Not True (As Far As You Know). Normally a par- 
ent or guardian can complete the CBCL/6-18 in approximately 1 5 minutes. Updated 
information on the CBCL/6-18 and other Achenbach products is available on the 
website of the ASEBA Products (www.aseba.org/products/lorms.html). 



Behavioral Assessment 311 

While professional counselors are strongly encouraged to follow the best prac- 
tices for assessment as outlined in this text, the fact remains that assessment can be 
a time-consuming process. The reality is that professional counselors are often re- 
stricted in the amount of time they can dedicate to assessment. Thus it is important 
to have various methods of gaining reliable and valid information about a client's 
presenting problems in a relatively short time. Self-report and informant-report be- 
havioral checklists and rating scales are practical assessment tools to identify problem 
behaviors and to obtain multidimensional information from clients and their signif- 
icant others. When selected thoughtfully, respondents to checklists and rating scales 
provide valuable, accurate, cost-effective, and time-effective insight into client be- 
haviors from the naturalistic settings. 



Think About It 9.2 Why would it be beneficial for a professional coun- 
selor to use both direct and indirect assessment approaches when evaluating 
a client? 



BEHAVIORAL RATING SCALES AND INVENTORIES 
USED IN COUNSELING 



The lines between clinical, behavioral, and personality assessments are quite blurred, 
the overlap in functions is sometimes pronounced. While there are innumerable as- 
sessment tools available for use by professional counselors, the tests and inventories 
that follow in this chapter are among the most commonly used for indirect assessment 
of behaviors. An overview of the format and psychometric properties of each instru- 
ment is provided. Hopefully, these reviews will help in the selection process of other 
instruments as well. It is important to note that only a few of the hundreds of avail- 
able rating scales are reviewed below, but the skills in understanding and using tests 
garnered from this text will help the reader evaluate, select, and use other instruments. 



Conners' Rating Scales-Revised (CR5-R) 



The Conners' Rating Scales — Revised (CRS-R) (Conners, 1997) is a multi-informant 
inventory designed to assess psychopathology and problem behavior in children and 
adolescents ages 3 to 17 years. It can be completed by parents and teachers, and can 
also be self-reported by adolescents. The CRS-R is available in four primary formats 
based on length and respondent: (1) a short form (27 items) of the Conners' Parent 
Rating Scales — Revised (CPRS-R:S); (2) a short form (28 items) of the Conners' 
Teacher Rating Scale — Revised (CTRS-R:S); (3) a long form of 80 items for parents 
(CPRS-R.-L); and (4) a long form of 59 items for teachers (CTRS-R:L). An adolescent 
self-report form, the Conners-Wells Adolescent Self-Report Scale (CASS), is available in 
long (CASS-L) and short (CASS-S) forms (Conners & Wells, 1997), and an adult 
form, the Conners' Adult ADHD Rating Scales (CAARS) (Conners, Erhardt, & 



312 Chapter 9 



Sparrow, 1999), is also available. Items measure such facets as Oppositional, Social 
Problems, Cognitive Problems/Inattention, Psychosomatic, Hyperactivity, Symptom 
Subscales, Anxious-Shy, ADHD Index, Perfectionism, and a Conners' Global Index 
(Conners, 1997). In addition, the long forms provide two DSM-IV subscales 
(Inattention, Hyperactive-Impulsive), scored in a straight symptom count or in com- 
parison to norms. Sample items from the CPRS-R.S include "Argues with adults," 
"Irritable," and "Deliberately does things that annoy other people." The CRS-R can 
be completed using pencil and paper in 5 to 10 minutes for the short version and 10 
to 20 minutes for the long version. This inventory can be completed by computer, 
remotely, or over the telephone, and is available in English, Spanish, and French- 
Canadian languages. 

The normative sample for the CRS-R consisted of over 8,000 cases in a large 
database compiled from over 200 collection sites throughout North America 
(Conners, 1997). This inventory requires Level B instrument qualifications and is 
written at the lOth-grade reading level for the parent and teacher forms, and at the 
6th-grade level for the long-form adolescent self-report (CASS:L). Subscale internal 
consistency coefficients are satisfactory, ranging from 0.73 to 0.94 for the CPRS-R.L; 
0.86 to 0.94 for the CPRS-R.S; 0.77 to 0.96 for the CTRS-R.L; 0.88 to 0.95 for the 
CTRS-R.S; and 0.75 to 0.92 for the CASS:L (Conners, 1997). Raw scores are con- 
verted into T scores and percentiles. The various versions of the CRS-R are helpful 
because they display AD/HD-type behaviors and track therapeutic progress 
(Giarnarris, Golden, & Greene, 2001; Townsend, Baylot, & Erford, 2006). Hand 
scoring of the protocols is easy using pressure-sensitive carbonless paper, but com- 
puter scoring and mail or fax scoring are also available. Clinicians need to be cautious 
when using this inventory for African American clients, because this group was un- 
derrepresented in the parent sample. It is an excellent screening device for AD/HD 
and general childhood psychopathology. 



Attention Deficit Disorders Evaluation Scale-Third Edition 
(ADDES-3) 



The Attention Deficit Disorders Evaluation Scale — Third Edition {ADDES-3) 
(McCarney & Arthaud, 2004a, 2004b) was designed to assess symptoms of AD/HD 
(inattentiveness, hyperactivity, impulsivity) in children and adolescents ages 4 to 18 
years. It is available in two versions: a Home Version of 46 items for parent report 
(ADDES-3-HV) and a School Version of 60 items for teacher report (ADDES-3-SV). 
Each version consists of two subscales: Inattentive and Hyperactive-Impulsive. A 
child's demonstration of a given behavior is rated on a 6-point scale: — Not 
Developmentally Appropriate for Age; 1 — Not Observed; 2 — One to Several Times 
per Month; 3 — One to Several Times per Week; 4 — One to Several Times per Day; 
5 — One to Several Times per Hour. Such a rating system allows for substantial speci- 
ficity in determining the frequency of display of a given behavior (Demaray & 
Elting, 2003). The ADDES-3 is a Level B test and generally requires 15 to 20 min- 
utes to administer and score. Scoring can be accomplished by hand or computer. 



Behavioral Assessment 3 1 3 

Raw scores can be converted to scaled scores (M= 10; SD = 3) and percentile ranks. 
Lower scaled scores or percentile ranks indicate higher levels of inattentiveness or hy- 
peractivity of the client (Erford, 2006). 

The standardization sample generally conformed to the 2000 U.S. Census pop- 
ulation demographics. However, the School Version had a lower percentage of White 
participants (62.42%) than the national sample (71.89%), and both the School 
Version and the Home Version contained higher numbers of Black participants 
(24.64% and 15.13%, respectively) versus the national sample (12.14%). The 
ADDES-3-SV age category coefficient alphas for the Inattentive subscale ranged 
from r = 0.89 to r = 0.98 (median = 0.98); the Hyperactive-Impulsive subscale 
ranged from r = 0.89 to r = 0.99 (median = 0.98); and the overall quotients ranged 
from r = 0.98 to r = 0.99 (median = 0.99). The ADDES-3-HV coefficient alphas for 
the age categories of the Inattentive subscale ranged from r = 0.90 to r = 0.97 (me- 
dian = 0.96); the Hyperactive-Impulsive subscale ranged from r = 0.95 to r = 0.97 
(median = 0.96); and the overall quotient ranged from r = 0.96 to r = 0.98 (median 
= 0.98). However, it is not stated whether coefficients were derived from raw scores 
or standard scores. If raw scores served as the basis for reliability coefficients, the es- 
timates would be inflated (Erford, 2006). Therefore, further analysis using standard 
scores should be conducted. 

Criterion-related validity studies provided in the manual used the ADDES-2, 
which contained very similar items in most regards. Bussing, Schuhmann, and Belin 
(1998) found the ADDES-2 produced a significant number of false positives and 
false negatives and that the results for girls were more accurate than those for boys. 
Overall, the psychometric characteristics of the ADDES-3 appear adequate for 
screening symptoms of AD/HD. Ancillary publications have been developed, in- 
cluding The Parents' Guide to Attention Deficit Disorders — Second Edition (McCarney 
& Baker, 1995) and the Attention Deficit Disorders Intervention Manual — Second 
Edition (McCarney, 1994). Klecker (2001, p. 91) was quite critical of these supple- 
ments, however, stating that the materials were "too fragmented to be either read- 
able or helpful. The supplements would be more useful with age-specific scenarios 
and practical examples." 



Behavior Assessment System for Children (BASC) 



The Behavior Assessment System for Children {BASC) (Reynolds & Kamphaus, 1992; 
1998) was designed to aid in the identification and diagnosis of emotional and 
behavior disorders in children and adolescents ages 2.5 to 18 years. It is a multi- 
informant, multi-assessment battery composed of five components: (1) Teacher 
Rating Scales (TRS); (2) Parent Rating Scales (PRS); (3) Self- Report of Personality 
(SRP); (4) Structured Developmental History (SDH); and (5) Student Observation 
System (SOS). Items on the TRS and PRS utilize a 4-point frequency rating rang- 
ing from Never to Almost Always. These components yield 4 composite scores 
(Internalizing Problems, Externalizing Problems, Adaptive Skills, and the 
Behavioral Symptoms Index) as well as 10 scale scores (Aggression, Hyperactivity, 



314 Chapter 9 



Anxiety, Depression, Somatization, Attention Problems, Atypicality, Withdrawal, 
Adaptability, and Social Skills). For each component, administration and scoring 
range from 10 to 30 minutes. The standardization samples generally conformed to 
the U.S. population demographics. 

The internal consistency coefficients for the TRS composites are generally high, 
ranging from r = 0.88 to r = 0.95 for the younger preschool age group, and from 
r = 0.90 to r = 0.96 for the older preschool age group. Somewhat lower were the co- 
efficients of the scales for both the younger (r = 0.71-0.92) and the older preschool 
age groups (r = 0.78-0.90). Test-retest studies, with a maximum of 2 months be- 
tween administrations, yielded correlations ranging from r = 0.90 to r = 0.95 (com- 
posites) and from r = 0.82 to r = 0.95 (scales) (Erford, 2006). Validity evidence for 
the BASC is based on factor analysis of its theoretical model, correlations with sim- 
ilar tests, and correlation matrices between the TRS and PRS. The BASC psychome- 
tric characteristics are quite sound, and it appears a robust measure for screening 
emotional and behavior symptoms (Witt & Jones, 1998), but Erford (2006) urged 
the use of subscale results only for hypothesis generation and validation, not diagno- 
sis, due to lower technical adequacy. Sandoval (1998) also indicated that the stan- 
dardization sample overrepresented children from Catholic and university-affiliated 
schools. Finally, Wilder & Sudweeks (2003) indicated that a lack of specific psycho- 
metric data on culturally diverse subpopulations indicates the need for caution when 
assessing and making decisions about culturally diverse youth. 



Disruptive Behavior Rating Scale (DBR5) 



The Disruptive Behavior Rating Scale (DBRS) (Erford, 1993) was designed to pro- 
vide quick, meaningful information regarding disruptive behaviors displayed by chil- 
dren ages 5-10 years. It assesses symptoms associated with distractibility, impulsive- 
hyperactivity, oppositional behavior, and antisocial conduct. The DBRS can be used 
as a preliminary screening tool, as part of a medical, psychological, or psychoeduca- 
tional evaluation, to target specific behaviors, or as a pretest-posttest measure of 
intervention effectiveness (Erford, 1993). It is available in two versions (teacher and 
parent), and separate norms are provided for teachers, mothers, and fathers. To 
eliminate cross-respondent confounds, each version of the DBRS contains 50 items 
with identical wording. All items are answered based on a 4-point frequency scale: 
— Rarely/Hardly Ever; 1 — Occasionally; 2 — Frequently; and 3 — Most of the 
Time. The DBRS generally requires 5 to 10 minutes for administration and is easily 
scored by hand or by computer (McKechnie, 2006). Raw scores are transformed into 
T scores, percentile ranks, and three interpretive ranges: Abnormal (T > 66); 
Borderline (60 < T < 66); and Normal (T < 60). The standardization sample under- 
represented minorities, rural residents, and individuals whose parents had lower lev- 
els of education (Erford, 1993). 

Cronbach's alpha reliability coefficients for the DBRS subscales were well above 
the minimum acceptable level (r > 0.80, discussed in Chapter 3) for the Distractible 
(r = 0.92-0.95; median = 0.92); Oppositional (r = 0.86-0.96; median = 0.88); and 
[mpulsivc-Hypcractivity (r = 0.88-0.96; median = 0.92) subscales. However, the 



Behavioral Assessment 3 1 5 

Antisocial Conduct subscale coefficients were substantially lower (r = 0.67-0.77; 
median = 0.73), most likely because it contains only four heterogeneous items. 
Similar results were found for 30-day test-retest studies. The DBRS's content, con- 
struct, and criterion-related validity when compared to factors in other tests were 
moderate to high (Erford, 1996, 1997a, 1998; McKechnie, 2006). Table 9.3 pro- 
vides a sample of output from the DBRS computerized scoring and interpretation 
system. 



Coping Inventory for Stressful Situations (CISS) 



The Coping Inventory for Stressful Situations (CISS) is a 48-item self-report inven- 
tory used to assess three major coping styles: (1) task-oriented, (2) emotion- 
oriented, and (3) avoidance-oriented. Each coping style is assessed through 16 
items. The CISS is based on Endler's (one of the authors of the CISS) multidimen- 
sional interaction model of stress, anxiety, and coping. According to Endler (1997), 
task-oriented coping contains efforts such as problem solving and situation chang- 
ing, whereas emotion-oriented coping contains self-oriented responses such as emo- 
tional reactions, self-preoccupation, and fantasizing. Avoidance-oriented coping 
contains activities or cognitive changes to avoid stressful situations (for details of 
the multidimensional interaction model and the three coping styles, see Endler, 
1997). There are two versions of the CISS: an Adolescent version (ages 13-18) and 
an Adult version (ages 18 and older). Paper-and-pencil record forms called 
"QuikScore™" are available. A 21-item brief format for adults (CISS: Situation 
Specific Coping [CISS:SSC]) is also available to assess coping style in situations in- 
volving social evaluation and interpersonal conflicts (Multi-Health Systems, Inc., 
2003). Current cost information and online ordering information are available on 
the website of Multi-Health Systems (2003). 

Each item of the CISS is formatted on a Likert scale ranging from 1 (Not at All) 
to 5 (Very Much). The CISS takes approximately 10 minutes to complete and has a 
Level A qualification for administration and interpretation. An examiner scores the 
CISS using a scoring grid and obtains a percentile rank and T score using a profile 
sheet on the back side of the scoring grid. Provided scales are Task, Emotion, and 
Avoidance. Avoidance consists of two subscales: Distraction and Social Diversion. 

Norms are provided for adults and adolescents (Tirre, 2003). For adults, sepa- 
rate male and female norms are provided for general-population and psychiatric pa- 
tients, respectively. For adolescents, separate norms are provided for individuals ages 
13-15 years and 16-18 years. Separate college student norms are also available. 
Endler (1997) found sufficient internal consistency and test-retest reliability for the 
CISS. Endler ( 1 997) also found the scores of the CISS to be valid. Through an ex- 
amination of construct validity, Endler discovered that some CISS scales were signif- 
icantly correlated with related measures, such as the Beck Depression Inventory (BDI) 
and the Eysenck Personality Inventory (EPI). 

Professional counselors interested in using the CISS are encouraged to explore 
Endler's multidimensional interaction model prior to use. Endler (1997) insisted on 
the necessity of examining not only the interaction between person and situation 



3 1 6 Chapter 9 



Table 9.3 Computerized DBRS report for a 7-year-old boy named Billy 



Respondent's name: Mrs. Jones, his teacher 

Summary Statistics and Critical Analysis Tables 

Scale Raw score SEM 



T score; Range °/oile Rank; Range Range of significance 



Distractible 


21 




4 


67; 63-71 96; 91-98 


Borderline to Abnormal 


Oppositional 


5 




6 


55; 49-61 69; 47-87 


Normal to Borderline 


Impulsive-Hyperactive 


21 




5 


74; 69-79 99; 97-99.81 


Abnormal 


Antisocial conduct (Aux) 


1 




13 


55:42-68 69; 21-96 


Normal to Abnormal 


Critical Item Analysis 












Scale 




Item 




Statements 




Distractible 




8 
22 
31 




Doesn't seem to remember what is said. 
Has difficulty following simple instructions. 
Does not finish activities undertaken. 




Oppositional 




None 








Impulsive- Hyperactive 




3 
6 

10 
13 
21 
34 
42 




Calls out unexpectedly. 

Fidgety. 

Finds it hard to await turn in group situations. 

Restless, squirmy. 

Interrupts. 

Has difficulty sitting still. 

Finds it hard to play quietly. 




Antisocial conduct (Aux) 




None 









Interpretation 

The Disruptive Behavior Rating Scale — Teacher Version (DBRS- T) is a 50-item inventory of common childhood behaviors 
associated with distractible, impulsive-hyperactive, oppositional, and antisocial behavior. Mrs. Jones's responses to the DBRS-T 
indicate that Billy is observed to perform in the Borderline to Abnormal range of distractible behavior. Billy is more distractible 
than approximately 96% of boys his age. Billy is having particular difficulty remembering what is said, following simple 
instructions, and finishing activities undertaken. 

Billy is observed as performing in the Abnormal range of impulsive-hyperactive behavior. Billy is more impulsive- 
hyperactive than approximately 99% of boys his age. Billy displays a significant inclination toward calling out unexpectedly, 
fidgeting, not awaiting turns in group activities, restless squirming, interrupting, difficult)' in sitting still, and difficult)' in playing 
quietly. 

Additionally, Billy performs in the Normal to Borderline range of oppositional behavior. Billy is more oppositional than 
approximately 69% of boys his age. No critical items were determined for this factor. 

Finally, Billy is rated to perform in the Normal to Abnormal range of antisocial conduct. Billy is more antisocial than 
approximately 69 percent of boys his age. No critical items were determined for this (actor. 

A diagnosis of Attention-Deficit Hyperactivity Disorder (AD/HD) should be considered. Validation of these findings 
through multiple methods of evaluation and multiple informants is recommended. 






Behavioral Assessment 3 1 7 

variables, but also the interaction within person variables (e.g., cognitive style, bio- 
logical variables) and situation variables (e.g., stressful events, physical environ- 
ments), given that "stress, anxiety, and coping all involve complex processes and all 
interact with one another" (Endler, 1997, p. 149). 



SUMMARY/CONCLUSION 



KEY TERMS 



Professional counselors should remember that the referral question should always 
drive the assessment process. All too often, assessment reports are driven by a one- 
size-fits-all approach. It is important to gather data from multiple methods and mul- 
tiple informants to evaluate how the identified individual differs from other individ- 
uals in the population (nomothetic comparisons) and to identify specific targets for 
remediation or therapy. Professional counselors should use a combination of behav- 
ioral interviews, rating scales, inventories, and direct observations to obtain a com- 
prehensive picture of the client and the specific referral concerns. Doing so not only 
provides appropriate services but constitutes best practices for ethical and legal obli- 
gations of service provision in the area of assessment. 



behavioral assessment indirect assessment 

behavioral interview interval recording 

direct assessment narrative recording 

functional behavioral assessment operational definition 

(FBA) self-monitoring 







CHAPTER 



10 



Assessment of Intelligence 

by Bradley T. Erford, Lauren Klein, and Kathleen McNinch 



Intelligence is an important human characteristic with robust applications to the 
areas of academic achievement, career development, and psychopathology. There 
is no commonly accepted definition of intelligence, and numerous models have 
been offered to explain and measure this construct. This chapter explores these mod- 
els and reviews many of the individual and group-administered tests designed to 
measure intelligence. In addition, important societal and educational issues and im- 
plications are discussed. 



WHAT IS INTELLIGENCE? 



"She's really smart." "He's about as bright as a burned-out light bulb." "She should 
aspire to raise her IQ to room temperature." "He's brilliant, simply brilliant!" At 
some time, most people have overheard (or perhaps made) a judgment about their 
own or someone else's probable level of intelligence. For more than a century, theo- 
rists and test developers have attempted to define and operationalize "intelligence." 
In 1921, 17 experts responded to an invitation by the editor of the Journal of 
Educational Psychology to define and describe their perspectives on intelligence. In 
1986, Sternberg and Detterman similarly consulted leading experts in the field. The 
result in both cases: The experts revealed great diversity and little commonality in 
their conceptions of what intelligence entails. Charles Spearman (1927, p. 14), a fa- 
mous theoretician and researcher in the field of intelligence, pessimistically con- 
cluded, "In truth, intelligence has become ... a word with so many meanings that 
finally it has none." 



319 



320 Chapter 10 



Various definitions of intelligence emphasized at least one of the following com- 
ponents (Sax, 1997): (1) origin — whether intelligence is inherited, learned, or both; 
(2) structure — its traits, facets, or components; and (3) function — its purpose, usually 
to aid in adjustment or survival. In a broad sense, intelligence is a human-contrived 
construct used to explain one's (genetic and/or learned) abilities to reason through 
and solve problems or dilemmas of importance to human adaptation. And as if 
defining intelligence isn't challenging enough, measuring it is even harder! The pre- 
mier challenge confronting researchers and test developers in the field of intellectual 
assessment is to operationally define the construct of intelligence from often-diver- 
gent theoretical perspectives. Therefore, nearly all tests of intelligence available for 
use today measure some conception of cognitive capability, but each does so from a 
somewhat different perspective. 

The term intelligence testing is virtually synonymous with the terms cognitive 
ability testing and mental ability testing. However, the term aptitude, while overlap- 
ping in many ways with intelligence, is a concept that implies a more specialized use 
of intellectual, perceptual, and motor abilities — usually with vocational or educa- 
tional applications. The area of aptitude assessment will be covered in further detail 
in Chapter 1 1 . Intelligence testing is undertaken to estimate a client's ability to com- 
prehend and express verbal information; to solve problems through verbal or non- 
verbal means (i.e., spatial, figural, visual); to learn and remember information (i.e., 
short-term, long-term); and to assess information processing efficiency. In short, in- 
telligence is a useful and robust concept with widespread clinical applications. 

While professional counselors may not frequently be the professional adminis- 
tering a given intelligence test, it is essential that professional counselors understand 
the nature of intelligence, the practical features of intelligence tests, and how these 
tests are used for clinical and educational decision making and for treatment and re- 
medial planning. For example, professional school counselors and other educational 
personnel use intelligence tests to help determine a student's eligibility for special ed- 
ucation services under the Individuals With Disabilities Education Improvement Act 
(IDEIA), and often for educational accommodations under Section 504 of the U.S. 
Rehabilitation Act of 1973. Mental health and community counselors use intelli- 
gence test information to establish effective treatment plans and to advocate on be- 
half of clients with special needs. Career and professional school counselors use in- 
telligence test information to help students and clients with educational planning 
and career choices. Intelligence test results are helpful decision-making tools appli- 
cable across a wide gamut of life decisions. 



Think About It 10.1 

working with students? 



low could intelligence testing be beneficial in 



Unfortunately, there is no widespread consensus over the definition of intelligence. 
Various researchers and test developers have conceived of very diverse theories of, 
and perspectives on, intelligence. Indeed, one could support the assertion that each 



Assessment of Intelligence 321 

intelligence test published and available today has a somewhat different theoretical 
underpinning. The differences are frequently slight, at other times vast. But keep in 
mind, all purport to measure this concept referred to as "intelligence." 



NATURE AND THEORIES OF INTELLIGENCE 



For more than a century, numerous researchers and test developers have attempted 
to define the construct of intelligence. While there is great diversity in these concep- 
tions of intelligence, typically intelligence tests measure, to a greater extent, verbal 
abilities and, to a lesser extent, abstract visual reasoning and quantitative skills. There 
is also general agreement that speed and efficiency of problem-solving capacities are 
characteristic of individuals with higher levels of intelligence (Jensen, 1985). 
Snyderman and Rothman (1986) surveyed 661 testing authorities, virtually all of 
whom agreed that intelligence involves, at a minimum, capacity to acquire knowl- 
edge, abstract reasoning, and general problem-solving capabilities. Some (e.g., 
Gardner, 1 983) even integrate personality variables into their definition. What fol- 
lows is a brief exploration of some conceptualizations, theories, and models of intel- 
ligence developed over the past century. Note how the construct of intelligence has 
at times evolved from simpler to more complex explanations, while at other times 
divergent pathways have led to new theoretical models and orientations. 



Historical Conceptualizations of Intelligence 



In the late 19th century, Sir Francis Galton and James McKeen Cattell believed in 
the importance of sensory acuities and capabilities as indications of intellectual 
prowess, because all information about the external world (and thus all potential 
learning) entered through the senses. To their way of thinking, the more highly de- 
veloped and attuned one's senses, the more intelligent one could become. While 
plausible on its surface, such a perspective fails to account for thinking or reasoning 
processes. In 1890, Cattell coined the term mental test, giving rise to the field of 
study now known as intellectual assessment. Unfortunately, from early on, other re- 
searchers (Wissler, 1901) demonstrated that the type of "intelligence" Cattell and 
others were proposing had little relationship to academic performance, failing to ex- 
plain why some students, particularly at the university level, do better or more poorly 
than others. Interestingly, Wissler's results later were criticized for using a sample 
with a restricted range of ability — a flaw that suppresses the magnitude of a correla- 
tion coefficient — as is discussed on the companion website. 

From the early reliance on sensory processing, definitions of intelligence evolved 
with a heavier focus on internal thinking and reasoning processes. At the same time, 
however, the concept of intelligence was also discussed primarily as a general, unidi- 
mensional construct. 

Alfred Binet 

Alfred Binet defined intelligence as the "tendency to take and maintain a definite di- 
rection; the capacity to make adaptations for the purpose of attaining a desired end" 



322 Chapter 10 



(as cited inTerman, 1916a, p. 45). Binet and Henri (1895a, 1895b, 1895c) studied 
facets of human intelligence that were far more complex and less easily measured 
than the simple sensory functions observed by Galton, including tasks of reasoning, 
comprehension, memory, judgment, and abstraction (Varon, 1936). Binet believed 
distinct thinking abilities were integrated into a general ability that was called on 
when solving problems. Thus, when one is solving a problem such as, "What should 
you do if your boat begins to sink in the middle of a large lake?" Binet believed that 
it was difficult to sort out the influence of, say, practical experience, memory, rea- 
soning, and verbal facility in the construction of an acceptable answer. This prelim- 
inary research led to the development of the first functional individual intelligence 
test by Binet and Simon (1905). 

David Wechsler 

Wechsler (1955, p. 7) once wrote: 

Intelligence, operationally defined, is the aggregate or global capacity of the indi- 
vidual to act purposively, to think rationally, and to deal effectively with his en- 
vironment. It is aggregate or global because it is composed of elements or abili- 
ties which, though not entirely independent, are qualitatively differentiable. . . . 
The only way we can evaluate it quantitatively is by the measurement of the var- 
ious aspects of these abilities. 

In 1939, Wechsler developed a test to measure the intelligence of individual 
adults. His test was composed of a collection of subtests adapted from the Army 
Alpha and Beta tests from World War I. His verbal subtests were modeled from items 
off the Army Alpha, and his performance subtests were modeled after items off the 
Army Beta. Combining the scores from the verbal and performance subtests yielded 
a full-scale intelligence estimate that Wechsler believed a good representation of g. 
However, the development of the original test and its various revised editions were 
driven more by clinical practice and implications than by theoretical considerations. 

Wechsler clearly acknowledged a general factor (g) composed of multiple com- 
ponents, and his intelligence tests, which will be discussed later in this chapter, have 
become the most commonly used in history. However, it is important to remember 
that while Wechsler stressed the essential role of cognitive capabilities in intellectual 
capabilities, he also recognized that a comprehensive understanding of intelligence 
involved noncognitive capacities, including "capabilities more of the nature of con- 
native, affective, or personality traits . . . such as drive, persistence, and goal aware- 
ness . . . [and] ... an individual's potential to perceive and respond to social, moral, 
and aesthetic values" (Wechsler, 1975, p. 136). 

Piaget's developmental model 

Swiss developmental psychologist Jean Piaget has made important theoretical contri- 
butions to the understanding of childhood intelligence (1954, 1971). Piaget believed 
that the function of intelligence was to help humans to adapt to the environment. 
As individuals become more intelligent, they progress through more advanced levels 
of symbolic representation. Eventually, physical trial and error is replaced by mental 



Assessment of Intelligence 323 

trial and error. To Piaget, 'learning was a consequence of an individual's interacting 
with the environment and encountering dilemmas that required mastery through a 
reorganization of thought. These organized structures were called schemata. Infants 
are born with some schemata (i.e., sucking, grasping) and learn about the environ- 
ment by coordinating these schemata to take in new information (Cohen & 
Swerdlik, 1999). Thus infants may grasp objects and place them into their mouths 
to more fully appreciate the object. Eventually, schemata of greater and greater com- 
plexity develop, departing from sole reliance on the physical realm and leading to 
cognitive transformations. As the individual interacts with the environment, existing 
schemata are constantly being refined, and new schemata are formed. 

Piaget proposed two methods by which humans organize their cognitive struc- 
tures and adapt to new contexts. Assimilation is the process by which individuals 
make sense of new information in terms of a structure or process that already exists. 
For example, small children generally know what a dog is and know a dog when they 
encounter it or see a picture of it. Every time they see a dog and recognize it as a dog, 
they are assimilating this information — making sense of it in terms of an existing 
structure. New information is related to old structures. Accommodation is the 
process by which individuals make sense of new information by changing the exist- 
ing structure or process, thus creating an adapted schema. For example, eventually 
children recognize that there are different types of dogs (i.e., golden retrievers, bea- 
gles, miniature poodles) and restructure the previously existing category (e.g., they're 
all dogs) into new, more meaningful categories designated by using diverse dog breed 
names. Thus new information is reorganized into new ways of thinking. 

Piaget was also instrumental in developing the now common assumption that 
there are qualitative differences in the way children think at various ages. His theory 
proposes four stages of development. The sensorimotor stage occurs during the first 
several years of life, and cognitions of the infant and toddler are basically limited to 
the sensory processes in the immediate environment (i.e., touch, taste, smell, sight, 
hearing). The preoperational stage generally occurs between ages 2 and 7 years and 
evolves from the child's emerging ability to reason symbolically (i.e., to use words to 
symbolize objects). The concrete operational stage involves the beginnings of logical, 
systematic thinking and generally develops between ages 7 and 12 years. Concepts 
such as conservation and reversibility of operations are important, but problem solv- 
ing is still predominated by direct, immediate experiences. Piaget's highest level of 
reasoning was called the formal operational stage. Generally emerging around age 12 
years, this period is marked by problem-solving strategies that rely on increasing lev- 
els of abstract, systematic, hypothetical thinking. Individuals can evaluate their own 
thought processes (metacognition) and more easily see how several variables relate, 
interact, and can be used to learn from and predict. Piaget's theory has been very in- 
fluential in education, influencing curricular activities, materials, and programs. 



Think About It 10.2 How would being aware of Piaget's stages of devel 
opment be useful when working with children? 



324 Chapter 10 



Verbal ability (s,) 



Visual-Spatial reasoning (s 4 ) 




General intelligence factor (g) 



Quantitative ability (s 2 ) 



Mechanical skills (s 3 ) 



Figure 10.1 Spearman's two-factor theory of intelligence 



Spearman's general-factor theory (g) 

A British statistician and psychologist named Charles Spearman, the innovator of a 
useful statistical technique now known as factor analysis, proposed a theory of intel- 
ligence that is referred to as both a "two-factor theory" and a "general factor theory" 
of intelligence. His theory proposed that a general factor (g) stands at the center of 
one's cognitive capacity, and that (perhaps numerous) specific factors (j,, * 2 > ^3. • • • 
s n ) are related to the general factor and help explain nuances and specialized charac- 
teristics observed in individuals. Spearman noticed that all measures of intelligence 
were positively correlated with academic performance, leading him to think that a 
common construct (the general ability factor [g]) underlay these measures and cre- 
ated the positive associations. Figure 10.1 provides a pictorial representation of 
Spearman's theory. 

Spearman also noticed that as he began to aggregate (i.e., add together) scores 
obtained on the simple sensory tasks and the reasoning and comprehension tasks 
commonly associated with intelligence at that time, the measures correlated in the 
0.30s with academic performance (Francher, 1985), substantially enhancing the pre- 
dictive usefulness of these tasks. Spearman became convinced that all measures of in- 
telligence were simply facets related to the general intelligence factor (g). Thus two 
tests measuring different facets of intelligence would overlap to some extent, depend- 
ing on the strength of their relationship tog. He reasoned that if all intelligence tests 
measured only general mental ability, the correlations between these tests would ap- 
proach r = 1.00. However, because these correlations were significantly less than r = 
1 .00, he assumed that a diverse set of specific factor elements (s x , s 2 , etc.) were what 
prevented the perfect correlations. "Spearman referred to g as the total mental en- 
ergy available to a person while the s factors were the engines through which this en- 
ergy was applied" (Janda, 1 998, p. 209). Some cognitive tasks required more general 
ability (g) than others, but all cognitive tasks required at least some. Spearman's two- 
fa< tor theory of intelligence was an important advance but was far from universally 
accepted. 



Assessment of Intelligence 325 



Multiple-Factor Models 



Multiple-factor theories propose that one's intellectual makeup is composed of 
many components that are more or less independent of each other. For example, 
while most people have normal or average verbal and visual-spatial reasoning abili- 
ties, others may be weak in both areas, and still others may be strong in both areas. 
Notice that, so far, this is in keeping with Spearman's general-factor theory. However, 
many people are normal or strong in verbal reasoning, but weak in visual-spatial rea- 
soning, and vice versa. The intellectual structure of these individuals is not explained 
by a single, general factor but is better explained by a theory that suggests that these 
two factors are independent and should vary according to individual cognitive 
strengths and weaknesses. Of course, the more factors that are included in the the- 
ory, the more complex the scenarios can become. 

Thurstone's Primary Mental Abilities 

An American psychologist, Louis L. Thurstone, from the University of Chicago, pro- 
posed that a collection of mostly independent primary abilities underlay intelligence, 
rather than the global general factor and multitude of specific factors proposed by 
Spearman. Interestingly, one of the things we know today about factor analysis that 
wasn't widely known 75 years ago is that the number of factors derived is in large 
part due to the number and diversity of the input (i.e., items, subtests, tests). Using 
the statistical technique multiple-factor analysis, Thurstone analyzed responses of 
more than 200 college students to 56 ability tests and derived 13 mental factors. He 
eventually settled on seven primary mental abilities, described in Table 10.1. It is 
important to understand that even Thurstone admitted that these factors were not 



Table 10.1 Thurstone's seven primary mental abilities 



Ability Description 



Verbal Comprehension (V) Assesses understanding and expression of ideas using language. (V) is measured by tasks 

involving vocabulary, analogies, and reading comprehension. 

Number (N) Assesses ability to solve numeric problems using basic math processes. (N) is measured by 

tasks involving rapid, accurate computation of simple math problems, story problems, and 
math calculation. 

Word Fluency (W) Assesses fluency of speech and writing. (W) is measured by tasks such as anagrams and word 

naming (e.g. words ending m-ing). 

Spatial (S) Assesses ability to visualize patterns and rotate objects in space. (S) is measured by tasks in- 

volving three-dimensional visualization, matrices, and block designs. 

Reasoning (R) Assesses inductive thinking and problem solving. (R) is measured by tasks involving logic, 

discerning a rule of operation or pattern, and number sequence patterns. 

Memory (M) Assesses rote memorization of information. (M) is measured by tasks involving recall of sen- 

tences, letters, digits, words, etc. 

Perceptual Speed (P) Assesses ability to quickly note and discriminate visual details. (P) is measured by tasks in- 

volving identification of similarities and differences in pictures or geometric objects. 



326 Chapter 10 



Table 10.2 Factors of the Horn-Cattell model 



Designation 



Name 



Description 



Gf 
Gc 
Gq 

Gv 
Ga 
Gs 
Gsm 

Glr 



Fluid intelligence 
Crystallized intelligence 
Quantitative ability 

Visual processing 
Auditory processing 
Processing speed 
Short-term memory 

Long-term retrieval 



Nonverbal reasoning, novel circumstances 

General knowledge, verbal comprehension and reasoning 

Understanding and problem solving using mathematical concepts and 

symbols 

Receiving and making decisions using visual and spatial stimuli 

Receiving and making decisions using auditory stimuli 

Ability to maintain attention and make quick, accurate decisions 

Ability to maintain and use information over a short time period (seconds 

to minutes) 

Ability to encode and store information for retrieval and use over a long 

time period (hours to years) 



totally independent, and that any given intelligence test could measure one, several, 
or even all of these dimensions. In fact, Thurstone developed the Primary Mental 
Abilities intelligence test in 1938 to do just that. Unfortunately, Thurstone's own test 
showed that several of the factors were highly correlated (e.g., the Verbal and 
Reasoning factors correlated nearly r = 0.60), calling into question the independence 
of these components of intelligence. Of course, critics of multiple-factor models were 
quick to explain this observation by using Spearman's general-factor model. Perhaps 
the most damaging contradiction of Thurstone's model is the inclusion of a total- 
scale score for the Primary Abilities Test, an admission, although perhaps inadvertent, 
that a general global factor has some interpretable meaning or predictive usefulness. 

Horn-Cattell Cc/Cf model 

Raymond Cattell (1943, 1963, 1971, 1979) proposed that intellectual abilities could 
be divided into two broad categories or second-order factors. Fluid abilities (GO were 
primarily inherited, perceptual capabilities thought to be mostly free of potential so- 
ciocultural bias. Tests measuring visualization, nonverbal, and spatial reasoning capa- 
bilities are direct assessments of fluid ability. Crystallized abilities (Gc) were primarily 
learned, acquired knowledge and skills that were socioculturally laden and heavily in- 
fluenced by formal and informal educational experiences. Tests measuring vocabulary, 
general information, verbal abstract reasoning, and social comprehension directly as- 
sess crystallized ability. Importantly, Cattell proposed that fluid and crystallized abil- 
ities are significantly correlated, especially among those who share a common cultural 
and educational background. Thus no pretense of factor independence was offered. 

In 1966, Cattell and John Horn became the major proponents of this model, 
and the model was expanded by Horn and his colleagues in subsequent years to add 
on additional factors derived through rational and factor analytic studies of multiple 
test batteries. Currently, the Horn-Cartcll model espouses eight components (sec 
Table 10.2), many of which have more or less provided the theoretical underpin- 



Assessment of Intelligence 327 



Contents 




Products 



Operations 
Figure 10.2 Guilford's structure-of-intellect model 

nings of the Stanford-Binet Intelligence Scales, now in its fifth edition {SB-5) (Roid, 
2003), and to a greater extent, the Woodcock-Johnson Tests of Cognitive Abilities — 
Third Edition (WJ-III COG) (Woodcock, Mather, & McGrew, 2001). 



Guilford's Structure-of-lntellect Model 



Guilford (1967, 1988; Guilford & Hoepfner, 1971) also used factor analysis to dis- 
cern a model of intellect but arrived at quite different conclusions than Spearman or 
Vernon about the existence ofg, and he rejected Thurstone's argument of the exis- 
tence of a number of independent primary mental abilities. Instead, Guilford pro- 
posed a theory in which 3 dimensions gave rise to approximately 1 80 unique specific 
factors (see Figure 10.2), as expressed within a 6 x 5 x 6 boxlike matrix. The first di- 
mension, mental operations, indicates what an individual does and includes 6 com- 
ponents: cognition, memory recording, memory retention, divergent production, 
convergent production, and evaluation. The second dimension, contents, indicates 
the materials upon which the individual performs various operations and includes 5 
components: visual, auditory, symbolic, semantic, and behavioral. The final dimen- 
sion, products, indicates the format into which individuals store and process informa- 
tion and includes 6 facets: units, classes, relations, systems, transformations, and 



328 Chapter 10 



General intelligence (g) 



2nd-Order 
Factors 



Major 

Facets/ 

Factors 



Specific 
Facets/ 
Factors 



VerbakEducational (v:ed) 



Practical (k:m) 



Verbal Quantitative 



h^ri rT 



Mechanical Spatial 




Humphreys 

Modification 

for Nonassigned 

Specific 

Facets/ 

Factors 

Figure 10.3 Hierarchical ability model proposed by Vernon and Humphrey 



implications. Each of the resulting 180 cells may contain a specific factor or a com- 
bination of specific factors, but each factor can be described in terms of its 3 com- 
ponents. Guilford's model has had little impact on the standardized measurement of 
intelligence, but nonetheless is a helpful model for understanding intelligence, par- 
ticularly as applied to education. 



Hierarchical Models 



Vernon (1960, 1965) suggested a model of intelligence that in some ways is a com- 
promise between the divergent theories proposed by Spearman and Thurstone. 
Vernon agreed that g underlay all facets of intelligence but noticed that certain clus- 
ters of various types of intelligence tests or subtests were too high to conclude thatg 
was the only factor accounting for the relationship. He proposed that two second- 
order factors comprised g, namely Verbal: Educational (v:ed) and Practical (k:m) ap- 
titudes. From these second-order factors, various skill areas branch off, which may be 
broken down into even lower-level facets (see Figure 10.3). For example, the Verbal: 
Educational factor may be assessed using tests measuring verbal comprehension and 
quantitative skill. Verbal comprehension skills may be further delineated and assessed 
by tests measuring vocabulary development, social comprehension, general informa- 
tion, and verbal abstract reasoning. These latter tests are more similar to the s factors 
proposed by Spearman or the individual cells proposed by Guilford. 



Assessment of Intelligence 329 

Other hierarchical models, such as the one proposed by Humphreys (1962, 
1970), argued for more flexibility in accounting for or assigning specific factors to 
higher-level factors. For example, it can be argued that in testing one's ability to solve 
analogies, it is helpful to use spatial, verbal, and numerical cues, each of which is rep- 
resented by specific factors. While the practical and theoretical applications of hier- 
archical models have allowed them to grow in popularity (Anastasi & Urbina, 1997), 
a primary limitation remains the lack of empirical validation of the model (Sax, 
1997). 

Sternberg's Triarchic Theory: An Information Processing Approach 

Sternberg (1988), using an information processing perspective, described a triarchic 
model, so named because it was composed of three aspects (subtheoretical compo- 
nents) of intelligence: componential (the person's internal world), experiential (the 
person's external world and adaptation to novelty), and contextual (the person's exter- 
nal world and environmental adaptation or creation). This theory arose from 
Sternberg's (1986, p. 33) belief that intelligence involved "purposive adaptation to, 
shaping of, and selection of real-world environments relevant to one's life." Sternberg 
stated that available tests of intelligence failed to measure the complex processes pro- 
posed in his theory. Sternberg's primary criticism of currently available intelligence 
tests is that they measure primarily memory and analytical reasoning skills that are 
useful in predicting school performance, predominantly because they are contextu- 
alized to school and learning problems, are short, and have a single correct answer. 
He believes these tests have little usefulness in predicting "real-world" performances 
people encounter in the world of work; what some call practical intelligence. 

In the componential subtheory, Sternberg identified three facets as being critical 
to the efficiency with which individuals process information. Metacomponents allow 
people to plan purposeful activities, self-monitor the implementation of these plans, 
and self-evaluate the effectiveness of the implementation. These are higher-level cog- 
nitive processes, sometimes called executive functioning, that help explain why some 
very bright and talented people accomplish a lot and others accomplish very little. 
According to Sternberg, the very intelligent person focuses on important tasks and 
issues — what some refer to as the "big picture" — plans them out, and accomplishes 
them. Less intelligent people focus on issues and situations that are less important — 
what some call the "little picture." Performance components allow individuals to 
process diverse information with varying degrees of efficiency by using mental skills 
such as information retrieval, encoding, or comparing. Knowledge acquisition in- 
volves an individual's capacity to select information relevant to a given problem con- 
text and then to compare and combine it with other relevant information, leading 
to insights, connections, and, eventually, new learning. Obviously, the more efficient 
one is at making relevant connections and gaining necessary insights, the greater 
one's capacity for learning (i.e., intelligence). 

The experiential subtheory views intelligence as an interplay of experience and in- 
formation processing. Thus, experienced individuals often appear more intelligent but 
only because they have encountered a problem in the past and recall how to resolve it 



330 Chapter 10 



appropriately. According to Sternberg, novel situations present a level playing field to 
determine adaptability and problem solving, because such circumstances favor those 
who process information more quickly and efficiently. In this way, Sternberg valued 
"automaticity," the ability to quickly learn information, processes, and procedures, 
thus freeing up the resources necessary for adaptation to novel situations. 

Finally, Sternberg's contextual subtheory involves adaptability in the external 
world, the context for practical, pragmatic decision making that allows humans to 
shape, adapt, and select environments in which to thrive. For example, we have all 
known individuals who did not do well in school but had a knack for adapting to 
new situations (contexts) and who do quite well for themselves. These individuals 
read and adapt to the environmental context. 

In 1994, Sternberg refined his theory by altering his terminology to include the 
terms memory-analytic, synthetic-creative, and practical-contextual abilities. Sternberg 
viewed memory-analytic functions as commonplace in education and science today, 
where people construct defined and delimited problems with predictable and "cor- 
rect" solutions. Synthetic-creative problems are those that are not entrenched in 
common assumptions, such as when an illogical assumption is given and the exam- 
inee is required to follow the assumption to its inevitable conclusion. Such out-of- 
the-box thinking requires flexible cognitive and reasoning processes that are difficult 
to teach, but which are nonetheless critical to creative problem solving. Practical- 
contextual abilities, also termed tacit knowledge ox practical intelligence, was defined 
as "action-oriented knowledge, acquired without direct help from others, that allows 
individuals to achieve goals they personally value" (Sternberg, Wagner, Williams, & 
Horvath, 1995, p. 916). Practical-contextual tasks help explain why some individu- 
als who score low on traditional tests of intelligence are able to solve sometimes com- 
plex everyday situations with more ease than their "more intelligent" counterparts. 
As an application of Sternberg's theory, Table 10.3 contains the types of items de- 
rived from a triarchic model. 

Fundamental to Sternberg's theory is that intelligence is not set; it is malleable 
and continually developing. Moreover, the display of an individual's intelligence can 
vary from one context to another; that is, people may be absolutely brilliant when in 
their "element" (i.e., the board room or chemistry lab), but substantially less so when 
not (i.e., the kitchen or nursery). 



Gardner's Multiple Intelligences 



Howard Gardner (1983, 1993) rejected the existence of g and identified eight dis- 
tinct intelligences that aid in an individual's adaptation to the environment. He de- 
fined intelligence as the ability "to resolve general problems or difficulties as they are 
encountered" (Gardner, 1983, p. 60) and identified the following eight intelligences: 
(1) verbal-linguistic, (2) logical-mathematical, (3) spatial, (4) musical, (5) bodily- 
kinesthetic, (6) interpersonal, (7) intrapersonal, and (8) naturalist (see Table 10.4). 
Gardner criticized current tests of intelligence for being primarily measures of verbal, 
spatial, and logical reasoning while ignoring other abilities that are, in some ways, so 



Assessment of Intelligence 331 



Table 10.3 Item types derived from Sternberg's triarchic model 



Item type 



Description 



Componential: Verbal 



Componential: Quantitative 



Componential: Figural 



Assesses a student's verbal ability when learning from relevant contexts, such as when a 

word is used in the context of a sentence and a student is asked to infer the word's meaning 

from context. 

Assesses numerically based inductive reasoning abilities by extrapolating from sequences of 

numbers. For example: When given the following sequence of numbers: 2, 4, 8, 16, ? : 

the student would choose 32 from a list of possible answers. 

Assesses inductive reasoning abilities through figure classifications and analogies. For 

example: 



O 



B. 





(b) 




(c) 




Coping With Novelty: Verbal 



Coping With Novelty: 
Quantitative 



Assesses the ability to think in relatively novel ways using hypothetical thinking or novel 
verbal analogies requiring counterfactual reasoning. For example: Assume snowflakes are 
made of sand. Which solution is now correct, given the assumption? Water is to drop as 
snow is to: (a) storm, (b) beach, (c) grain, (d) ice. 

Assesses quantitative coping with novelty skills by using number matrix items, but with an 
element of novelty. Usually, items involve symbols used in place of certain numbers and 
require the examinee to make a number substitution. For example: 



12 



Coping with novelty: Figural 



(a) 14, (b) 4, (c) 17, (d) 8. 

Assesses a student's ability to complete a pictorial series in a "newly mapped domain," (not 
the domain in which the student has constructed or inferred the rule). For example, 



A. A 




□ 



B. 



□ 



continued 



332 Chapter 10 



Table 10.3 continued 



Item type 



Description 



(a) 




( c, D 



(d) 




Automatization: Verbal 



Automatization: Quantitative 



Automatization: Figural 



Assesses rapid decisions of a verbal nature. For example, are the following letters from the 

"same" category (both vowels, both consonants) or "different" categories (vowel or 

consonant): "b, n" (same); "e, m" (different); "u, o" (same); "g, i" (different). 

Assesses rapid decisions of a quantitative nature. For example, are the following numbers 

from the "same" category (both odd, both even) or "different" categories (odd or even): "2, 

4" (same); "9, 6" (different); "7, 3" (same); "8, 5" (different). 

Assesses rapid decisions of a figural nature. For example, do the following figures have the 

"same" or "different" numbers of sides? 




c. 



Practical: Verbal 



Practical: Quantitative 



Practical: Figural 



Assesses practical, everyday problem-solving abilities requiring verbal inferential reasoning. 
For example: The sign at Bill's Market reads, "The lowest meat prices in town." If the ad is 
for real, which of the following is most likely true? 

(a) Bill's Market charges more than Sam's. 

(b) No other market charges less than Bill's. 

(c) Bill is a successful businessman. 

(d) Bill's is the busiest market in town. 

Assesses practical, everyday problem-solving abilities requiring quantitative reasoning. For 
example: Given a recipe for making two dozen cookies and an inventory of ingredients 
tin rent ly in the house, the examinee may be asked, "How many dozen cookies could be 
baked without having to go to the store for more supplies?" 

Assesses practical, everyday problem-solving abilities requiring figural reasoning. For 
example: A student may be shown a town map and be asked to chart the shortest route 
from one place in the town to another. 



much more important in adapting to the environment and solving real-world prob- 
lems. For example, intelligence tests rarely identify outstanding musical, athletic, or 
intrinsic motivation potential. Gardner's relatively independent intelligences were 



Assessment of Intelligence 333 



Table 10.4 Howard Gardner's multiple intelligences 



Intelligence 



Description 



Linguistic 
Logical-Mathematical 

Spatial 

Musical 
Bodily-Kinesthetic 

Interpersonal 



Intrapersonal 



Naturalist 



The ability to use language to express ideas and understand others. Linguistic intelligence is 

displayed by lawyers, teachers, orators, writers, and linguists. 

The ability to understand underlying causal systems, inductive and deductive logic, 

scientific reasoning, numerical reasoning, and numerical operations. Logical-mathematical 

intelligence is displayed by mathematicians, logicians, scientists, and engineers. 

The ability to understand, visualize, and manipulate mental images, graphic 

representations, or objects in space. Spatial intelligence is displayed by sculptors, painters, 

surgeons, architects, and navigators. 

The ability to think musically and rhythmically by hearing, remembering, and 

manipulating patterns. Musical intelligence is displayed by musicians of any kind. 

The ability to use one's body to solve complex motor problems through awareness and 

control of motor functions. Bodily-kinesthetic intelligence is displayed by athletes, dancers, 

actors, and seamstresses. 

The ability to understand and work with other people, read their verbal and nonverbal 

communication, be sensitive to the feelings of others, and solve problems of an 

interpersonal nature. Interpersonal intelligence is displayed by professional counselors, 

salespeople, managers, politicians, and just about anyone else who has to deal with people 

problems. 

The ability to understand oneself; what one can do, can't do, self-motivations, propensities, 

and aversions. Intrapersonal intelligence involves metacognition, self-awareness, and 

abstract thinking. It relies on self-awareness and is important in virtually any endeavor. 

The ability to discriminate among and classify objects. Naturalist intelligence is displayed 

by farmers, botanists, hunters, and chefs. 



identified through a process that involved several criteria, including occurrence 
across cultures, the effects oflocalized brain damage, and the distinct history of ex- 
ceptional ability. 

While Gardner does not dispute the importance of genetics, he clearly points 
out that intelligence stems from an interaction between heredity and environment. 
For example, consider a case in which two children of equal musical talent are born 
into two separate families. The first family values musical talent and expends great 
time and effort to cultivate Johnny's burgeoning skills. The second family not only 
doesn't value musical talent, but actively punishes its expression whenever possible, 
frequently telling the child, "Stop playing with violins and cellos, Jimmy. You'll have 
no need of them in your career as a professional counselor!" Certainly, the odds of 
developing substantial musical intelligence are in Johnny's favor. Gardner's theory is 
thought provoking and has received much attention in classrooms and schools 
around the United States. Unfortunately, there are numerous problems when trying 
to measure several of the intelligences, and the empirical support behind the theory 
is less than robust. 



334 Chapter 10 



Some Final Thoughts on the (Practical) Nature of Intelligence 



Richard Hernstein and Charles Murray (1994), in their very controversial book The 
Bell Curve: Intelligence and Class Structure in American Life, categorized the theories 
proposed by Spearman, Binet, and their contemporaries as classicist. The common 
thread to classicist models was the adherence to a unifying factor, g, at the center of 
intellectual being. Another broad category proposed by Hernstein and Murray was 
the revisionist models. Revisionist theories proposed that there was indeed a unifying 
factor, g, at the center of cognitive structure, but g was composed of several second- 
ary factors (i.e., verbal reasoning, nonverbal reasoning, working memory, processing 
speed), each of which contributes to one's total cognitive makeup. Furthermore, re- 
visionist models assert that individual clients can have strengths and weaknesses in 
each of these processing categories. In the end, it is the combination of these 
strengths, weaknesses, or normal capacities that make up one's total cognitive func- 
tion (g). However, various patterns or combinations of cognitive skills, while perhaps 
resulting in the same estimate of overall intelligence (g), may lead to very different 
results in terms of bow problems are solved. For example, when required to write an 
extensive report about some social phenomenon, a client with excellent verbal rea- 
soning skills and poor nonverbal reasoning skills may be able to excel at the task, 
while a client with the identical overall IQ (g), but with low verbal reasoning skills 
and outstanding nonverbal reasoning skills, may struggle mightily. Some well-known 
revisionist models include those of Vernon, Horn-Cattell, and Guilford. To this day, 
psychometricians and statisticians continue to debate whether intelligence can be 
meaningfully represented by a single global score (the classicist position) or is global 
with multidimensional refinements (the revisionist position). 

Hernstein and Murray (1994) referred to a third movement within the field of 
intellectual assessment as the radicals, and pointed to Gardner's theory of multiple 
intelligences as a prime example. Gardner rejects the existence of g and lauds the in- 
dependence of the intelligences he has identified. While appealing in its own right 
and widely used in education, little empirical support for the independent nature of 
the identified intelligences exists. 

While development, classification, and description of intelligence tests is cer-