Assessment
for
Counselors
Bradley T. Erford
WVWWUWVyWUWWWWWVWUWWU 1
Digitized by the Internet Archive
in 2012
http://www.archive.org/details/assessmentforcouOObrad
Assessment
for Counselors
BRADLEY T. ERFORD
Loyola College in Maryland
? BROOKS/COLE
CENGAGE Learning-
Australia • Brazil • Japan • Korea • Mexico • Singapore • Spain • United Kingdom • United States
Dedication
This effort is dedicated to The One: the Giver of energy, passion, and
understanding; who makes life worth living and endeavors worth
pursuing and accomplishing; the Teacher of love and forgiveness.
; BROOKS/COLE
1* CENGAGE Learning-
Assessment for Counselors
Bradley T. Erford
Publisher: Barry Fetterolf
Senior Editor: Mary Falcon
Editorial Assistant: Evangeline Bermas
Senior Project Editor: Kimberly Gavrilles
Art and Design Manager: Gary Crespo
Composition Buyer: Chuck Dutton
Associate Manufacturing Buyer:
Brian Pieragostini
Director of Sales and Marketing:
Heather Murray
Cover image © Mark Stephen/
theispot.com
© 2007 Brooks/Cole, Cengage Learning
ALL RIGHTS RESERVED. No part of this work covered by the copyright
herein may be reproduced, transmitted, stored, or used in any form or by
any means graphic, electronic, or mechanical, including but not limited to
photocopying, recording, scanning, digitizing, taping, Web distribution,
information networks, or information storage and retrieval systems,
except as permitted under Section 107 or 108 of the 1976 United States
Copyright Act, without the prior written permission of the publisher.
For product information and technology assistance, contact us at
Cengage Learning Customer & Sales Support, 1-800-354-9706.
For permission to use material from this text or product, submit
all requests online at www.cengage.com/permissions.
Further permissions questions can be emailed to
permissionrequest@cengage.com.
Library of Congress Control Number: 2006923762
ISBN-13: 978-0-618-49291-6
ISBN-10: 0-618-49291-7
Brooks/Cole
20 Davis Drive
Belmont, CA 94002-3098
USA
Cengage Learning is a leading provider of customized learning solutions
with office locations around the globe, including Singapore, the United
Kingdom, Australia, Mexico, Brazil, and Japan. Locate your local office at:
www.cengage.com/global.
Cengage Learning products are represented in Canada by Nelson
Education, Ltd.
To learn more about Brooks/Cole, visit www.cengage.com/brookscole.
Purchase any of our products at your local college store or at our preferred
online store www.cengagebrain.com.
Printed in the United States of America
5 6 7 8 9 10 13 12 II
CONTENTS
PARTI
Preface xiii
Acknowledgments xiv
About the Authors xv
Chapter 1 Basic Assessment Concepts Bradley T. Erford 1
Assessment and Counseling 1
What Is Assessment? 2
The Purpose of Assessment 5
How Is Assessment Used in Counseling? 8
Assessment Competence and Professional Counselors 9
Training Standards for Professional Counselors 1
Professional Counselor Organizations and Assessment 10
Assessment Training Standards 1 2
Assessment Terms and Concepts 21
Standardized (Formal) and Nonstandardized (Informal) Tests 21
Norm-Referenced and Criterion-Referenced Tests 22
Individual and Group Tests and Inventories 23
Objective and Subjective Tests 23
Speed and Power Tests 23
Verbal and Nonverbal Tests 24
Cognitive and Affective Tests 26
Maximum and Typical Performance Measurement 27
Behavioral Observations 28
Basals, Starting Points, and Ceilings 28
Reliability 32
Validity 33
Formative Versus Summative Evaluation 34
Pencil-and- Paper Tests and Performance (Authentic) Assessment 34
Portfolio Assessment 36
Environmental Assessment 38
Computer-Managed, Assisted, and Adapted Assessment 38
Summary/Conclusion 42
Key Terms 42
iii
iv Contents
Chapter 2 Foundations of Assessment:
Historical, Legal, Ethical, and Diversity Perspectives
Bradley T. Erford, Cheryl Moore-Thomas, and Lynn Linde 45
The History of Assessment 45
Ancient Times 48
Measurement in the Laboratory 49
Modern Clinical Applications of Assessment: Decision Making
and Determination of Individual Differences 50
Public and Professional Concerns About Assessment 62
Decisions About Peoples' Lives Should Not Be Made on the Basis
of a Single High-Stakes Test Score 64
Tests Are Biased and Unfair to Minorities and Women 64
Tests Create Anxiety and Stress 65
Tests Label and Categorize 65
Test Developers Dictate What Students Must Know or Learn 66
"Teaching to the Test" Inflates Scores 67
Multiple-Choice Questions Punish Intelligent, Creative Thinkers;
Trivialize the Complexities of the Learning Process;
and Reward Good Guessers 67
Learning From Past Mistakes and Criticisms 68
Ethics and Assessment 69
Ethical Decision Making 78
Legal Issues in Assessment 80
The Family Educational Rights and Privacy Act of 1 974 (FERPA)
and Related Legislation 81
Minimal Competency Assessment and the No Child Left Behind Act
of2001 83
The Individuals With Disabilities Education Improvement Act
of 2004 (IDEIA) and Related Legislation 84
The Health Insurance Portability and Accountability Act
ofl996(HIPAA) 86
Guidelines of the Equal Employment Opportunity Commission
(EEOC) 87
The Americans With Disabilities Act of 1 99 1 (ADA) 8 8
Court Decisions Related to Diversity in Assessment 88
Diversity Issues in Assessment 90
Understanding Diversity 90
Standards for Multicultural Assessment 91
Diversity Factors Involved in Assessment 91
Bias in Assessment 94
Content Bias 94
Contents v
Internal Structure Bias 95
Predictive Bias 95
Interpreting Test Scores With Caution 95
Ensuring Fairness in Assessment 96
Summary/Conclusion 97
Key Terms 97
Chapter 3 Reliability Dimiter Dimitrov 99
What Is Reliability? 99
The Classical Model of Reliability 101
True Score 101
The Classical Definition of Reliability 102
Standard Error of Measurement (SEM) 102
Types of Reliability 105
Internal Consistency 105
Test-Retest Reliability 1 08
Alternate Forms Reliability (Equivalent Forms Reliability) 109
Reliability of Criterion-Referenced Tests 110
Interscorer and Interrater Reliability 113
The Importance of Reliability 114
Reliability in Validation 114
Attenuation 114
Reliability of Composite Scores 116
Reliability of Sum of Scores 116
Reliability of Difference Scores 118
Reliability of Weighted Sums 119
Summary/Conclusion 120
Key Terms 121
Chapter 4 Validity Alan Basham and Bradley T. Erford 123
Validity Defined 123
Face Validity 124
Content-Related Validity 125
Criterion-Related Validity 126
Standard Error of Estimate 128
Construct Validity 131
vi Contents
The Interaction of Reliability and Validity 133
Validity and Testing Practice 133
The Application of Validity: Decision Making
Using Test Scores 134
Decision Making Using a Single Score 1 34
Decision Making Using Multiple Tests 14 1
Summary/Conclusion 157
Key Terms 1 57
Chapter 5 Selecting, Administering, Scoring, and Interpreting
Assessment Instruments and Techniques
R. Anthony Doggett, Carl J. Sheperis, Susan Eaves,
Michael D. Mong, and Bradley T. Erford 159
Test Selection 1 59
Test Administration 160
Administrator Requirements 160
Examinee Preparation 162
Environmental Concerns 163
Testing Procedures 163
Factors Affecting Test Scores 164
Test Scoring 165
Professional Standards in Testing 166
Norm-Referenced Interpretation 168
Developmental Equivalents 168
Scores of Relative Standing 170
Percentile Ranks 172
Applying Standard Error of Measurement (SEM) to Test Scores 173
Criterion-Referenced Interpretation 180
Single-Skill Scores 180
Multiple-Skill Scores 180
Sources of Information About Tests 181
Published Resources 182
PRO-ED 183
Publisher Catalogs 184
Professional Journals and Textbooks 184
Electronic Resources 1 84
Common Errors 185
Summary/Conclusion 187
Key Terms 188
Contents vii
Chapter 6 How Tests Are Constructed
Carl J. Sheperis, Carey Davis, and R. Anthony Doggett 189
Purpose of the Test 190
Examinees 1 9 1
Goals and Theory 191
Norm Referenced or Criterion Referenced 191
Objectives 1 92
Scaling 192
Approaches to Test Construction 1 94
A Test Development Example 1 94
Observables 196
Defining Observables 197
An Example of Observables 198
Item Generation 198
Allocating Proportionate Numbers of Items 199
Selecting an Item Format 199
Descriptions of Item Formats 199
An Example of Item Generation 20 1
Technical Analyses 201
Item Difficulty 202
Item Discrimination 203
Norms 204
Summary/Conclusion 204
Key Terms 206
PART II
Chapter 7 Clinical Assessment Bradley T. Erford, Carol Salisbury,
Kathleen McNinch, Carl J. Sheperis, R. Anthony Doggett,
and Ota Masanori 207
What Is Clinical Assessment? 207
Cautions Within Clinical Assessment 209
Clinical Judgment Versus Statistical Models 213
Clinical Interviewing 214
Three Types of Interviews: Unstructured, Semi-Structured,
and Structured 214
The Intake Interview 216
Mental Status Exam 217
Strengths and Limitations of Interviewing 2 1 9
viii Contents
Counseling, Diagnosis, and the DSM-IV-TR 221
Using the DSM-IV- TR— Multiaxial Diagnosis 223
Axis I Disorders — Clinical Disorders and Other Conditions That May Be
a Focus of Clinical Attention 226
Axis II Disorders — Personality Disorders and Mental Retardation 229
Axis III — Current Medical Conditions 229
Axis IV — Psychosocial and Environmental Problems 230
Axis V — Global Assessment of Functioning (GAF) 230
Diagnostic Decision Making Using the DSM-IV- TR 23 1
Using Clinical Inventories and Tests in Counseling 234
Information Sources for Clinical and Personality Assessment 234
How Clinical and Personality Test Content Is Developed 235
Some Commonly Used Clinical Assessment Inventories 237
Minnesota Multiphasic Personality Inventory — Second Edition
(MMPI-2) 237
Minnesota Multiphasic Personality Inventory — Adolescent (MMPI-A) 24 1
Millon Clinical Multiaxial Inventory — III (MCMI-III) 246
Millon Adolescent Clinical Inventory (MACI) 253
Achenbach System of Empirically Based Assessment (ASEBA) 254
Personality Inventory for Children — Second Edition (PIC-2) 257
Devereux Scales of Mental Disorders (DSMD) 258
Children's Depression Inventory (CDI) 258
Reynolds Adolescent Depression Scale — Second Edition (RADS-2) 259
Symptom Checklist-90— Revised (SCL-90-R) 260
Beck Depression Inventory — Second Edition (BDI-II) 260
Beck Anxiety Inventory (BAI) 26 1
Beck Scale for Suicide Ideation (BSSI) 262
Substance Abuse Subtle Screening Inventory — 3 (SASSI-3) 263
Eating Disorder Inventory — 3 (EDI-3) 264
Summary/Conclusion 265
Key Terms 265
Chapter 8 Personality Assessment
Bradley T. Erford, Kathleen McNinch, and Carol Salisbury 267
What Is Personality? 267
The Purpose of Personality Assessment 268
Trait Approaches to Personality Assessment 269
Strengths and Limitations of the Trait Approach 271
Some Commonly Used Structured Personality
Assessment Inventories 273
Contents ix
Revised NEO Personality Inventory (NEO-PI-R) 273
16 Personality Factors (1 6PF) Questionnaire 275
Myers-Briggs Type Indicator — Form M (MBTI) 279
Millon Index of Personality Styles Revised {MIPS Revised) 28 1
Personality Assessment Inventory (PAT) 281
California Psychological Inventory (CPI) 282
Jackson Personality Inventory — Revised (JPI-R) 283
Piers-Harris Children's Self Concept Scale — Second Edition
(Piers-Harris-2) 286
Coopersmith Self Esteem Inventories 287
Tennessee Self Concept Scale — Second Edition { TSCS-2) 287
Projective Approaches to Assessment 288
Strengths and Weaknesses of Projective Techniques 295
Some Commonly Used Projective Techniques 296
Rorschach Inkblot Test 296
Thematic Apperception Test {TAT) 297
Children's Apperception Test — 1991 Revision {CAT) 297
Roberts Apperception Test for Children — Second Edition {Roberts-2) 298
House-Tree-Person {H-T-P) Projective Drawing Technique 298
Kinetic Drawing System for Family and School {KDS) 300
Forer Structured Sentence Completion Test {FSSCT) 300
Summary/Conclusion 302
Key Terms 302
Chapter 9 Behavioral Assessment Carl J. sheperis, R. Anthony Doggett,
Masanori Ota, Bradley T. Erford, and Carol Salisbury 303
What Is Behavioral Assessment? 303
Defining Behavior 304
Guidelines for Conducting Behavioral Assessment 305
Methods of Behavioral Assessment 306
Direct Assessment 306
Indirect Assessment 309
Behavioral Rating Scales and Inventories Used in Counseling 311
Conners' Rating Scales — Revised { CRS-R) 311
Attention Deficit Disorders Evaluation Scale — Third Edition {ADDES-3) 312
Behavior Assessment System for Children {BASQ 313
Disruptive Behavior Rating Scale {DBRS) 314
Coping Inventory for Stressful Situations { CISS) 315
Summary/Conclusion 317
Key Terms 317
Contents
Chapter 10 Assessment of Intelligence
Bradley T. Erford, Lauren Klein, and Kathleen McNinch 319
What Is Intelligence? 319
Nature and Theories of Intelligence 321
Historical Conceptualizations of Intelligence 321
Multiple-Factor Models 325
Guilford's Structure-of-Intellect Model 327
Hierarchical Models 328
Sternberg's Triarchic Theory: An Information Processing Approach 329
Gardner's Multiple Intelligences 330
Some Final Thoughts on the (Practical) Nature of Intelligence 334
Commonly Used Tests of Intelligence 335
Group-Administered Tests of Intelligence and School Ability 335
Individual Screening Tests of Intelligence 338
Individual Diagnostic Tests of Intelligence 340
Assessing Mental Retardation 350
Assessing Giftedness 352
Summary/Conclusion 354
Key Terms 354
Chapter 1 1 Assessment of Other Aptitudes
Bradley T. Erford and Kathleen McNinch 357
Aptitude Tests Designed for Admission Decisions 358
Commonly Used Admission Tests 359
Tests of General and Specific Aptitude 366
Multiaptitude Batteries 366
Measures of Special Abilities 375
Summary/Conclusion 384
Key Terms 384
Chapter 12 Assessment of Achievement
Bradley T. Erford and Kathleen Hall 385
Why Assess Achievement? 385
Uses of Achievement Tests in Counseling 387
Achievement Testing and Individuals With Special Needs 388
Contents xi
The Individuals With Disabilities Education Improvement Act
(IDEIA) 388
Section 504 of the U.S. Rehabilitation Act of 1973 392
Categorizing Achievement Tests 393
Group-Administered Multi-Skill Achievement Test Batteries 395
Individual Achievement Multi-Skill Test Batteries 406
Individual and Group-Administered Single-Skill Achievement Tests
for Reading 416
Individual and Group-Administered Single-Skill Achievement Tests
for Mathematics 422
Individual and Group-Administered Single-Skill Achievement Tests
for Written Expression 424
Tests of English Language Proficiency 429
Summary/Conclusion 432
Key Terms 432
Chapter 1 3 Assessment in Career Counseling
Deborah Newsome, Bradley T. Erford, and Kathleen McNinch 435
Purposes of Career Assessment 435
Assessing Interests 437
Tests Measuring Interests 440
Other Interest and Skill Inventories 454
Assessing Values and Life Role Salience 456
Commonly Used Tests Assessing Values and Life Role Salience 457
Other Measures of Career Values and Life Role Salience 458
Assessing Career Development and Career Maturity 460
Tests Used to Assess Career Development and Career Maturity 461
Summary/Conclusion 463
Key Terms 463
Chapter 14 Assessing Couples and Families
Debbie W. Newsome, Jon-Michael Brasfield, and Catherine Flemming 465
Purposes of Couple and Family Counseling 465
Rationale for Family Assessment 466
What Is Assessed? 467
Methods of Assessment 470
Formalized Assessment Instruments 470
xii Contents
Assessment of Couples 471
Other Instruments Used in Assessing Couples 481
Assessment of Families 482
Other Measures of Family Assessment 489
Qualitative Assessment of Family Relationships 490
Characteristics of Qualitative Assessment 490
Qualitative Assessment Methods 49 1
Mapping Activities 493
Sculpting Activities 498
Other Qualitative Methods 500
Summary/Conclusion 501
Key Terms 501
Appendix Responsibilities of Users of Standardized Tests
(RUST) (3rd Edition) Association for Assessment
in Counseling (AAC) 502
References 509
Name Index 554
Subject Index 560
PREFACE
Assessment is counseling and counseling is assessment! The evolving profession of
counseling has entered the age of accountability, regardless of specialization or prac-
tice venue. Managed care and school reform have become important forces driving
decision making in contemporary society. Given this context, the more a profes-
sional counselor knows about formal and informal assessment procedures, the more
informed, effective, and efficient the professional counselor's treatment of clients and
students can be.
A second driving force comes from within the counseling profession itself. After
many years of identity exploration and discussion, the counseling profession has
agreed to a basic core of education and training standards that all professional coun-
selors should meet. This book is designed to address the core curricular assessment
requirements of the Council for Accreditation of Counseling and Related
Educational Programs (CACREP), thereby providing state-of-the-art information
on assessment and tests that professional counselors need to know. But what makes
Assessment for Counselors different from other books is that it is written by profes-
sional counselors for professional counselors.
The first half of Assessment for Counselors provides important general informa-
tion about assessment, including basic concepts, historical developments, ethical and
legal implications, diversity issues, reliability, validity, test construction, and the se-
lection, administration, scoring, and interpretation of assessment instruments. The
second half of this book provides in-depth explorations of the major areas of assess-
ment that professional counselors either provide or of which they must be aware.
Embedded within these domains of counseling specialty, this text includes reviews of
more than 100 commonly used tests in the areas of clinical, personality, behavioral,
intelligence, aptitude, achievement, career, and couples and family assessment. In
short, Assessment for Counselors is the most comprehensive introductory assessment
text ever written specifically for professional counselors.
XIII
ACKNOWLEDGMENTS
The editor would like to thank Kami McNinch, Lauren Klein, Katie Hall, and
Megan Earl for their tireless assistance in the preparation of the original manuscript.
All of the contributing authors are to be commended for lending their expertise in
the various topical areas or on the various tests reviewed in this volume. As always,
Barry Fetterolf, publisher, and Mary Falcon, senior editor of Lahaska Press, have
been wonderfully responsive and supportive. Finally, special thanks go to three out-
side accuracy reviewers who carefully scrutinized the entire manuscript and whose
comments led to substantive improvement in the final product: Gerald Chandler,
University of Central Oklahoma; Darcy Haag Granello, The Ohio State University;
and Joshua C. Watson, Mississippi State University, Meridian.
XIV
ABOUT THE AUTHORS
THE EDITOR
Bradley T. Erford, Ph.D., is director of the School Counseling Program and a pro-
fessor in the Education Department at Loyola College in Maryland. He is the recip-
ient of the American Counseling Association's (ACA) Professional Development
Award, ACA Research Award, and the ACA Carl Perkins Government Relations
Award, and is an ACA Fellow. He has received the Association for Counselor
Education and Supervision's Robert O. Stripling Award for Excellence in Standards,
the Association for Assessment in Counseling and Education/Measurement and
Evaluation in Counseling and Development Research Award, the Maryland
Association for Counseling and Development's Maryland Counselor of the Year,
Professional Development, Counselor Visibility, and Counselor Advocacy Awards.
His research specialization is primarily in development and technical analysis of psy-
choeducational tests and has resulted in the publication of numerous books, journal
articles, book chapters, and psychoeducational tests.
He is past chair of the American Counseling Association-Southern (U.S.)
Region; past president of the Association for Assessment in Counseling and
Education; past president of the Maryland Association for Counseling and
Development; past president of the Maryland Association for Counselor Education
and Supervision; past president of the Maryland Association for Mental Health
Counselors; and president of the Maryland Association for Measurement and
Evaluation. Dr. Erford is the past chair of ACA's Task Force on High Stakes Testing;
past chair of ACA's Task Force on Standards for Test Users; past chair of ACA's Public
Awareness and Support Committee; and past chair of ACA's Interprofessional
Committee. Dr. Erford is a licensed clinical professional counselor, licensed profes-
sional counselor, nationally certified counselor, licensed psychologist, and licensed
school psychologist. He teaches courses primarily in the areas of assessment, human
development, school counseling, and stress management.
THE CONTRIBUTING AUTHORS
Alan Basham, M.A., is a counselor educator at Eastern Washington University,
where he teaches (among other subjects) advanced appraisal for CACREP programs
in school counseling and mental health counseling. He is past president of the
Association for Spiritual, Ethical and Religious Values in Counseling and of the
Washington Counseling Association. He drafted ACA's Code of Leadership and
xv
xvi About the Authors
served on the task forces that wrote ACA's position papers on test user qualifications
and high-stakes testing. He also provides leadership and teamwork training for
Washington State's Critical Incident Management teams. He lives near, and often
roams with his dog Chinook through, the woods surrounding the Spokane River.
Jon-Michael Brasfield, M.A., NCC, is a recent graduate of Wake Forest
University's counseling program. He is a professional school counselor at R.J.
Reynolds High School in Winston-Salem, North Carolina. Jon plans to pursue fur-
ther training in educational research methods and statistics in the near future.
Carey Davis is obtaining her educational specialist degree in school psychology
from Mississippi State University. Her areas of interest include academic assessment
and intervention and group contingencies.
Dimiter Dimitrov has a Ph.D. degree in mathematics education from the
University of Sofia, Bulgaria and a Ph.D. degree in educational psychology from
Southern Illinois University, Carbondale. Currently, he is an associate professor of
educational measurement and statistics in the Graduate School of Education at
George Mason University, Fairfax, Virginia. He is also editor of the professional jour-
nal Measurement and Evaluation in Counseling and Development. Dr. Dimitrov's areas
of expertise and teaching experience include classical and modern measurement the-
ory, generalizability theory, and quantitative research methods. His recent research
interests focus on validations of cognitive operations and processes using tools of
item response theory and structural equation modeling, and on latent trait model-
ing for measurement of change.
R. Anthony Doggett, Ph.D., is an assistant professor in the school psychology
program at Mississippi State University. Dr. Doggett received his doctorate in school
psychology from the University of Southern Mississippi. He completed a predoctoral
internship and a postdoctoral fellowship in behavioral pediatrics at the Munroe-
Meyer Institute for Genetics and Rehabilitation in Omaha, Nebraska. His profes-
sional interests include applied behavior analysis, functional behavioral assessment,
behavioral consultation, parent training, instructional interventions, and behavioral
pediatrics.
Susan H. Eaves is a doctoral student in counselor education at Mississippi State
University. Her research interests center around Borderline Personality Disorder,
Conduct Disorder, and marital infidelity. She holds national certification and is a li-
censed professional counselor.
Catherine Flemming, M.A., NCC, is the director of Lay Ministry at Centenary
United Methodist Church in Winston-Salem, North Carolina. As part of her church
ministry, she places members in service opportunities appropriate for their gifts and
interests. In addition, she provides individual, marital, premarital, and group coun-
seling. She is a trained PREPARE/ENRICH administrator.
Kathleen Hall completed her master's degree in the School Counseling Program
of the Education Department at Loyola College in Maryland. She is currently a pro-
fessional school counselor in Florida.
Lauren Klein completed her master's degree in the School Counseling Program
of the Education Department at Loyola College in Maryland. She is currently a high
school counselor in Harford County Public Schools, Maryland.
About the Authors xvii
Lynn Linde is an assistant professor of education and the director of Clinical
Programs in the School Counseling Program at Loyola College in Maryland. She re-
ceived a master's degree in school counseling and a doctorate in counseling from
George Washington University. Dr. Linde was previously chief of the Student
Services and Alternative Programs Branch at the Maryland State Department of
Education, the Maryland State specialist for school counseling, a local school system
counseling supervisor, a middle and high school counselor, and a special education
teacher. She has made numerous presentations on ethics and legal issues for coun-
selors, and public policy and legislation over the span of her career. Dr. Linde is the
recipient of the ACA Carl Perkins Award, the Association for Counselor Education
and Supervision's Program Supervisor Award, and the Southern Association for
Counselor Education and Supervision's Program Supervisor Award, as well as nu-
merous awards from the Maryland Association for Counseling and Development
and from the state of Maryland for her work in student services and youth suicide
prevention.
Kathleen McNinch completed her master's degree in the School Counseling
Program of the Education Department at Loyola College in Maryland. She is cur-
rently a high school counselor in Howard County Public Schools, Maryland.
Michael D. Mong received a B.S. degree in psychology from Louisiana State
University and is currently a Ph.D. student in school psychology at Mississippi State
University. His research interests include language acquisition, behavior disorders,
standardized versus nonstandardized testing procedures, and selective mutism. He is
currently employed as a behavioral specialist with Head Start programs and is prima-
rily responsible for student observations and assessments of both academics and be-
havioral concern.
Cheryl Moore-Thomas received her Ph.D. degree in counselor education from
the University of Maryland. She is a national certified counselor. Currently, Dr.
Moore-Thomas is an assistant professor of education in the school counseling pro-
gram at Loyola College in Maryland. Over her professional career, she has published
and presented in the areas of multicultural counseling competence, racial identity
development of children and adolescents, and accountability in school counseling
programs.
Deborah Newsome, Ph.D., LPC, NCC, is an assistant professor of counseling
at Wake Forest University, North Carolina, where she teaches courses in career coun-
seling, appraisal procedures, and statistics and supervises master's degree students in
their field experiences. In addition to teaching and supervising, Dr. Newsome coun-
sels children, adolescents, and families at a nonprofit mental health organization in
Winston-Salem, North Carolina.
Masanori Ota is a graduate student pursuing an educational specialist degree in
school psychology at Mississippi State University and is from Tokyo, Japan. Her re-
search interests are functional behavioral assessment, functional behavioral analysis,
and behavioral consultation in schools.
Carol Salisbury is a doctoral student in the Pastoral Counseling Department at
Loyola College in Maryland. Her research interests include exploring the positive as-
pects of anger as a recuperative and useful emotion.
xviii About the Authors
Carl J. Sheperis, Ph.D., NCC, LPC, is an assistant professor in the Department
of Counseling, Educational Psychology, and Special Education at Mississippi State
University. Dr. Sheperis's areas of specialization include assessment and treatment of
behavioral disorders and psychopathology. He is co-owner of Behavioral Research,
Assessment, and Training Services LLC, a psychological corporation primarily serv-
ing Head Start organizations.
CHAPTER
1
Basic Assessment Concepts
by Bradley T. Erford
This initial chapter provides a whirlwind tour through the critical terminology,
purposes, and standards related to assessment. Assessment is sometimes
viewed as having a language all its own, so professional counselors are well ad-
vised to learn this language in order to communicate with other professionals, and
to advocate for, and make decisions in the best interests of, the clients and students
they serve.
ASSESSMENT AND COUNSELING
Welcome to the world of counseling: a world of wonder, mystery, and fulfillment; a
world where highly trained professional counselors attempt to understand and help
people encountering trauma and challenges or adjusting to life circumstances; a
world of clients and students (i.e., clients served by professional school counselors or
college counselors) trying to get back on track. By nature, human beings are complex
creatures made up of unique genetic structures and even more unique personal and
psychosocial experiences. In the clinical sense, these factors combine to create clients
and students who think, feel, and behave in individualistic ways — so individualistic
that no clinician, no matter how skilled, can ever predict the client's actions with
1 00% accuracy. In this sense, people are somewhat like puzzles — some simpler to
understand and solve than others, but all with pieces that never quite seem to fit, or
are even missing. Nevertheless, the more professional counselors know about a client
or student, the better they can understand and predict how the individual will react
under certain circumstances.
Chapter 1
This is what assessment is all about. It is integral to the counseling process; the
professional counselor is always assessing. When a professional counselor first meets
a student or client, the process of assessment for understanding begins. This process
may be informal, formal, or somewhere in between; it may be structured, unstruc-
tured, or somewhere in between. The point is, assessment begins from the moment
the professional counselor meets the student or client: Data are collected, impres-
sions are formed, and pieces of the puzzle are collected, analyzed, and fitted.
Assessment continues as the professional counselor helps the student or client to se-
lect therapeutic objectives and treatments. Assessment culminates in an evaluation of
treatment outcomes to determine therapeutic success, or to obtain feedback indicat-
ing that other treatment methods are needed. Assessment is counseling, and coun-
seling is assessment. Indeed, assessment is integral to every stage in the counseling
process (Whiston, 2005).
We emphasize the interrelationship of assessment and counseling on the very
first pages of this book because students new to the profession often show little ex-
citement for a course in measurement or assessment. Unfortunately, counselor-edu-
cators who teach counseling assessment sometimes report that counseling students
rate it low (close to research and statistics courses) on the "exciting courses scale." So
please make the connection between assessment and counseling early in the course
and your career: Assessment is the quickest way to understand students and clients.
The better one understands clients or students, the better and faster one will be able
to help them. Assessment saves the client time, money, and (most importantly) so-
cial and emotional pain. The more efficient a professional counselor becomes in
knowing a student or client, the more effective and respected the counselor will
become.
The purpose of this book is to help professional counselors to understand the
most efficient and effective means for discovering, analyzing, and fitting the puzzle
pieces together to understand and help students and clients. The reader will no
doubt discover that some of the methods described are faster, more effective, techni-
cally more superior, and personally more appealing than others. There is wonderful
diversity in how the puzzle pieces can be acquired and configured. Indeed, many cli-
nicians assessing the same client through different methods may arrive at varying
conclusions because of personal perspectives. Thus, in many ways this course, at its
core, is about who you will become as a professional counselor. How will you discern
the pieces of your developing professional identity, your strengths and weaknesses?
How will you cope with the challenging coursework and its applications to clinical
settings? What cognitive abilities, behavioral patterns, and personality dispositions
will become barriers? Which will provide the resiliency needed to succeed? Let the as-
sessment begin!
WHAT IS ASSESSMENT?
For all intents and purposes, and especially from a professional point of view, the
terms assessment and appraisal Ave synonymous. In this book, we use the term assess-
ment {or psychological assessment) consistently throughout. Assessment was defined in
Basic Assessment Concepts 3
Standards for Educational and Psychological Testing (AERAj APA/NCME, 1999, p. 3)
as "a process that integrates test information with information from other sources
(e.g., information from the individual's social, educational, employment, or psycho-
logical history)." Note that the preceding definition distinguishes assessment from test,
instrument, or inventory in that assessment includes testing as only part of its process.
Many authoritative sources differ slightly in their definitions of what comprises a psy-
chological test. An often-cited definition of a psychological test is that provided by
Anastasi and Urbina (1997, p. 4): "an objective and standardized measure of a sample
of behavior." Assessment integrates tests in a way that helps a professional counselor
to better understand clients and make decisions in their best interests.
Often overlooked, but implicit in the foregoing definition of a psychological test
is the word measure. Measure implies that a quantity of some construct or concept
will be determined: how much anxiety, intelligence, math skill, introversion, suici-
dal ideation, alcohol use, artistic interest, antisocial tendency, etc. The purpose of an
assessment is to give the professional counselor valuable information regarding "how
much" of a given characteristic the student or client possesses. Knowing how much
helps to predict client behaviors, strengths, and weaknesses, thus facilitating impor-
tant treatment or life decisions.
Second, assessments measure a sample of behavior. Behavior is what humans do,
whether the "doing" be overt physical acts, emotional or affective displays, or cogni-
tions that are conveyed to others. Sampling is key to understanding any psycholog-
ical phenomenon. If a professional school counselor observes a student's activity level
during different activities (e.g., physical education class, independent in-seat class
work time, lecture presentation, lunchtime), these different samples of behavior will
often lead to different observable data and subsequent conclusions. Likewise, in a
clinical setting, professional counselors usually see clients only under fairly specific
conditions (i.e., in an office), again leading to a specific sample of behavior. Samples
of behavior assessed under various conditions are critical to understanding the stu-
dent or client. These measures and observations allow professional counselors to
make inferences about how clients will behave or perform under normal and unusual
circumstances. Such inferences are indispensable to the client's insight and self-un-
derstanding, as well as to the insight of the professional counselor charged with the
responsibility of helping the client to develop goals and an effective treatment plan.
When assessing a sample of behavior, it is important that the sample faithfully
represent the total domain of behavior under study. For example, when assessing sin-
gle-plus-single-digit addition without regrouping (i.e., 4 + 3, not 8 + 7), the test de-
veloper needs to determine how many problems of this type are required to assess a
student's mastery of the behavioral domain — that is, how many of the 57 possible
single-plus-single-digit-addition-without-regrouping problems would a child need
to successfully perform before the examiner could have confidence the student had
mastered this type of addition? One? Two? Five? All 57? Efficient sampling of behav-
ior is crucial to effective assessment.
Sometimes the professional counselor is also interested in the perspectives of
others (i.e., teacher, parent, spouse) who have observed a sample of the client's behav-
iors under various conditions. These more indirect methods help professional coun-
Chapter 1
selors to provide insights into student or client behavior in other environments not
easily accessed by the clinician. The common factor here is that the data collection,
analysis, and judgment of professional counselors are influenced by tangible obser-
vations of behavioral samples. But what if two professional counselors observe the
same sample of behavior only to reach different conclusions?
As a final piece of the definition of a psychological test, the terms standardized
and objective are meant to work hand in hand to address counselor judgment as a
potential source of error. Standardization refers to the systematic collection and
analysis of data. Cronbach (1984) provided a comprehensive definition of standard-
ization when he referred to a standardized test as one in which exact devices, mate-
rials, verbal (or nonverbal) prompts, and scoring procedures have been fixed so that
scores collected at various places and times and by different examiners are fully
equivalent. Objective tests have scoring or observation criteria structured to such an
extent that different examiners (e.g., trained judges, interviewers) have a very high
likelihood of independently agreeing on a client's score on a given sample of per-
formance behavior. To be sure, psychological assessments have varying degrees of
standardization and objectivity. For example, on a multiple-choice test of written ex-
pression for a 5th-grade student, different examiners may easily agree that answer
choice b is correct, but when asked to determine the maturity of written expression
in this student's essay, less agreement is likely, because the scoring of essays often in-
volves more subjective (less objective) scoring criteria. Of course, the more standard-
ized the written-expression assessment procedures and the more objective the scor-
ing procedures, the greater is the likelihood of examiner agreement.
Test developers strive to develop high-quality, accurate standardized and objec-
tive tests (samples of behavior), and professional counselors strive to administer these
instruments according to standardized procedures and to score each according to ob-
jective criteria. Sounds like a perfect way to collect information about, and under-
stand, a client, right? Unfortunately, even the best standardized and objective psy-
chological assessments can lead to inaccurate conclusions. For example, the Reynolds
Adolescent Depression Scale — Second Edition (RADS-2) (Reynolds, 2002), which will
be discussed in Chapter 7, is incredibly easy to administer and score using the stan-
dardized procedures, and is very objective. However, if students or clients do not
want a professional counselor to think they are depressed, they need only to "fake
good" on their test responses, and the test score will not indicate significant levels of
depression. Thus, an unsuspecting professional counselor may not reach the appro-
priate conclusion and may therefore not develop the most effective treatment plan
for the client.
Test developers and assessment specialists have developed countermeasures to
help detect dishonesty and inaccurate responses — for example, some clinical, per-
sonality, and behavioral inventories include validity scales. Also, professional coun-
selors are trained to understand that all clients present information from their own
point of view, and thus the counselor will seek validation of client perceptions from
various sources of information (i.e., tests, inventories, rating scales, observations, in-
terviews, questionnaires) and respondents (i.e., spouses, parents, teachers, peers) as
Basic Assessment Concepts 5
possible and appropriate. These issues involve the reliability and validity of scores
and the decisions based on those scores and will be addressed throughout the re-
mainder of this book. But prior to entering that realm, one must understand the
multiple purposes for which professional counselors use assessment.
The Purpose of Assessment
At least four purposes of assessment have been identified in the extant literature
(Erford, 2006; Gregory, 1999; Sattler, 2001): screening, diagnosis, treatment plan-
ning and goal identification, and progress evaluation.
Screening
Screening is a quick procedure, usually involving a single measure, done for the pur-
pose of determining whether deeper diagnostic assessment is necessary or warranted.
A screening process is by no means comprehensive, and the instruments used for this
purpose are sometimes held to lower standards of psychometric accuracy, although
this is not always a desirable practice. Accuracy in screening is just as critical as ac-
curacy in diagnosis because both procedures, done correctly, save students and clients
emotional pain, time, and money. In all instances, professional counselors strive to
use procedures that will maximize accurate decisions and minimize inaccurate deci-
sions. For example, when conducting a screening procedure for depression, a profes-
sional counselor will frequently use a self-report inventory of depression with a pre-
determined cutoff to determine clinical significance. A client scoring above that
cutoff score would be referred for further (diagnostic) assessment. Or, when a pro-
fessional school counselor conducts a screening to determine which students are at
risk for reading difficulties, students scoring below the predetermined level (perhaps
< 25th percentile) will subsequently be referred for deeper-level assessments to fur-
ther diagnose any reading difficulties and develop an effective treatment plan.
Screening is an efficient first step in an assessment process because not every student
or client needs diagnostic assessment. Diagnostic assessment tends to be more ex-
pensive and more time consuming than screening and requires a greater level of skill
to conduct, but there is a worthwhile trade-off in terms of efficiency and accuracy.
Anastasi and Urbina (1997) referred to accurate identification decisions (some-
times called hits) as true positives (clients who have a condition are identified by the
screening test as having the condition) and true negatives (clients who do not have
the condition are identified by the screening test as not having the condition).
Inaccurate decisions (sometimes called misses) were referred to as false positives
(clients who do not really have the condition are identified as having it) and false neg-
atives (clients who really do have the condition are not identified as having it). (A
graphic of these concepts can be found in Figure 4.2.) In screening procedures, pro-
fessional counselors are most concerned with maximizing hits and minimizing
misses, particularly false negatives, because these clients have the problem of concern
but do not receive further diagnostic assessment to address the problem. They "slip
through the cracks."
Chapter 1
Diagnosis
Diagnosis entails "a detailed analysis of an individual's strengths and weaknesses,
with the general goal of arriving at a classification decision" (Erford, 2006, p. 2).
Diagnosis always involves more than one measure and often includes a battery of
tests. Such a battery is usually composed of a series of tests that are integrated to yield
specific information or identification decisions. For example, the Wechsler Intelligence
Scale for Children — Fourth Edition (WISC-IV) (Wechsler, 2001a) and the Woodcock-
Johnson: Tests of Achievement — Third Edition {WJ-III ACH) (Woodcock, Mather, &
McGrew, 2001) are frequently used in conjunction to determine the existence and
extent of learning disabilities in school-aged children. In some cases, diagnostic as-
sessment can be used to enhance normal development, as when a client presents for
career counseling and the professional counselor wants to assess the individual's in-
terests, competencies, values, and interpersonal strengths and weaknesses to help the
person to arrive at an acceptable career goal, educational plan, or vocational strategy.
Similarly, in premarital counseling, which is currently becoming more popular, mar-
riage and family counselors use diagnostic assessments to aid in leading couples to in-
terpersonal and intrapersonal insights that will strengthen the bonds of the relation-
ship and help the couple to predict and navigate the challenges of marriage and
family life.
In general, diagnosis in counseling can be construed as trying to understand
what is happening with a client, what the problem is, what causes or maintains the
problem, and what strengths the client may harness to overcome the problem.
However, in clinical contexts, diagnostic assessment has classification or diagnosis as
its goal. This process generally requires the use of a classification system, and most
professional counselors in clinical practice use the Diagnostic and Statistical Manual
of Mental Disorders — Fourth Edition — Text Revision (DSM-TV-TR) (APA, 2000). The
DSM-IV- TR provides clinicians from all mental health professions (e.g., professional
counselors, psychiatrists, psychologists, social workers) with a standardized set of cri-
teria upon which to base a diagnosis (i.e., a clinical description) of a client's present-
ing condition. Such a system facilitates accurate, reliable decisions and helps to in-
form the professional counselor of appropriate treatment strategies. The DSM is to
mental health practitioners what the International Classification of Diseases {ICD) is
to physicians and to mental health workers in most other countries that do not use
the DSM. However, there is disagreement in the counseling profession regarding the
helpfulness of diagnosis to clients, as it frequently results in labeling of a client that
may lead to a plethora of unintended and undesirable consequences (see Sattler,
2001).
Treatment Planning and Coal Identification
Helping clients and students is what counseling is all about. Assessment helps clients
and students to understand where they are and where they want to go, a key facet of
developing a client's goals and objectives for counseling. A counseling process that
does not have well-defined and measurable goals has no focus or direction, nor does
it allow the client and professional counselor to know when the goals of counseling
Basic Assessment Concepts 7
have been achieved. Thus, a primary purpose of assessment in counseling is to help
establish counseling goals, often through a combination of assessment methods, in-
cluding interviewing and standardized testing.
In addition, the information garnered from an initial assessment can be help-
ful in planning a client's treatment. Frequently, student or client strengths, weak-
nesses, challenges, and resiliency factors and resources are confirmed or better un-
derstood through assessment procedures. "Treatment planning must flow logically
from assessment results, fit the given environmental context of the client, and be
individualized to mesh with the client's strengths and weaknesses" (Erford, 2006,
p. 3). After the client and professional counselor agree on the goals and objectives
to be pursued through counseling, the counselor must consider the most effective
treatment options to obtain the desired outcomes. Thus a primary focus of the ini-
tial assessment is to uncover student or client strengths and resources in order to
plan for the most effective treatment. Of course, counseling would be incredibly
simplified if specific test scores or client responses directly implied specific treat-
ments or interventions. Unfortunately, the complexity of client problems rarely
leads to such simplistic remedies. Important sources of information to help pro-
fessional counselors with treatment planning are the outcomes research literature
found in professional journals and compendiums of this research (e.g., Sexton,
Whiston, Bleuer, & Walz, 1997; Whiston, 2003a). As a final note, treatment plan-
ning usually gets easier with experience and, to some, may be more akin to art than
science. In some employment settings, professional counselors often approach
treatment of client problems from a theoretical paradigm that they are proficient
in or comfortable with. When it comes to treatment planning, assessment often
informs the professional counselor's practice.
Progress Evaluation
Once goals for counseling have been agreed on and treatment has begun, it is a pro-
fessional counselor's responsibility to ensure that the treatment is helpful to a client
(and, even more important, not harmful). This process is referred to as progress eval-
uation or outcomes evaluation and, unfortunately, is frequently minimized in, or elim-
inated from, a treatment regimen. Failure to periodically evaluate treatment progress
is unethical and unprofessional, not to mention inefficient. If a treatment is having
no positive effects and a professional counselor is not assessing its impact, the client
is wasting time and money while continuing to experience the discomfort and emo-
tional pain that brought the client to counseling. Tests and inventories can be very
helpful aids in assessing treatment outcomes.
The first step in evaluating progress is to establish a baseline measure of the stu-
dent's or client's condition. This evaluation is generally done during an intake inter-
view and initial assessment but can also be done at the time a counseling goal is es-
tablished. Progress evaluation can be done formally or informally, subjectively or
objectively. For example, an informal, subjective method would be to ask clients to
rate their own feelings of anxiety (disorganization, depression, distractibility, etc.) on
a scale from to 10, with being the total absence of anxiety and 10 being intense
8 Chapter 1
anxiety. If the client self-rates as a 9, this score becomes a baseline for comparison in
future similar assessments, perhaps conducted at the beginning of each session over
the following weeks. A more formal, objective method might involve a test such as
the Beck Anxiety Inventory (BAT) (Beck, 1993). The client's initial score would serve
as the baseline, and the counselor would periodically readminister the BAI to assess
whether the client's anxious symptoms have declined. Furthermore, given the client's
baseline score, it is possible to establish a goal of a certain score on the BAI as a tar-
get to determine when the anxiety has subsided to a substantial enough degree that
termination of counseling can be considered.
The four purposes reviewed above provide a framework for the general use of
assessment, but assessment is best applied to the practical aspects of counseling when
fully integrated into the counseling process. The next section presents this fully in-
tegrated model.
HOW IS ASSESSMENT USED IN COUNSELING?
As mentioned previously, assessment is counseling, and counseling is assessment.
Assessment is totally integrated into the counseling process. Whiston (2005) re-
ported that most counseling processes delineate at least the following four steps:
(1) assessing client problems, (2) conceptualizing and defining client problems,
(3) selecting and implementing effective treatments, and (4) evaluating counsel-
ing effectiveness.
In the first stage, professional counselors engage in screening and diagnostic assess-
ment procedures to understand student or client concerns, issues, and problems. It is par-
ticularly important that professional counselors conduct a comprehensive interview
and administer appropriate tests and inventories to assess for broad functioning in
the interest of "leaving no stone unturned." Incomplete assessments lead to incom-
plete and ineffective treatment plans. It is best practice to ask these broad questions
and conduct formalized assessments in the beginning of counseling rather than not
ask, thus risking an underestimation of the scope of a problem or missing it alto-
gether. The type of formal assessment used is often dependent on the nature of the
setting and on the training and experience of the professional counselor. Elmore,
Ekstrom, Diamond, and Whittaker (1993) reported that nearly three-quarters of the
professional counselors surveyed indicated that assessments and tests were either im-
portant or very important in their work setting. Predictably, the work of professional
school counselors most frequently involved contact with achievement, intelligence,
aptitude, and career or vocational measures (Elmore et al.; Giordano & Schweibert,
1997), while the work of community and mental health counselors most frequently
involved contact with clinical diagnostic, personality, intelligence, and vocational in-
ventories (Bubenzer, Zimpfer, & Mahrle, 1990; Frauenhoffer, Ross, Gfeller,
Searight, & Piotrowski, 1998).
During the second stage of the counseling process, conceptualizing and defining
problems, incomplete information will again limit a professional counselor's effective-
ness (Mohr, 1995). Professional counselors must continuously assess their under-
standing of client concerns during the process of constructing a working definition
Basic Assessment Concepts 9
of a client's problem. Counselors at this point must reciprocally rule in and rule out
diagnostic categorizations and determine the frequency and severity of client con-
cerns. Again, attention to comprehensiveness and detail at this stage will lead to a
more effective treatment outcome.
Treatment selection and implementation relies on an analysis of the results of as-
sessments conducted during the first two stages of the counseling process. Again, the
professional counselor questions the comprehensiveness of previous assessments and
conducts additionaJ assessment as required. Most importantly, process evaluation be-
gins at this time; it is the duty of the professional counselor to continuously assess the
impact of the treatment strategies implemented. In evaluation parlance, this is re-
ferred to as formative assessment and allows for midcourse adjustments in treatment
implementation to provide the most effective treatment possible. Formative assess-
ment helps determine whether or not progress is being made toward treatment goals.
Finally, during the fourth stage of counseling, evaluation, determinations must
be made regarding the overall effectiveness of treatment — a process that evaluation
specialists refer to as summative evaluation or outcomes assessment. One of the reasons
a baseline measurement is so highly recommended in counseling is that it provides
a starting point for treatment and evaluation. Evaluation at the end of counseling
provides another point of comparison that allows professional counselors to demon-
strate to clients, students, and other stakeholders (i.e., employers, parents, insurance
companies) that substantive, measurable gains have been noted, counseling goals
have been met, and counseling services have been effective.
By now the meaning of the statement "assessment is counseling, and counseling
is assessment" should be amply clear. Indeed, there was a time, during the 1 930s and
1940s, when assessment and counseling were viewed synonymously (Hood &
Johnson, 2002). Assessment is an essential, integrated part of an effective counseling
process.
ASSESSMENT COMPETENCE
AND PROFESSIONAL COUNSELORS
Professional counselors have a professional responsibility to become competent in
the effective use of assessment procedures. A number of professional associations,
scholars, and accreditation organizations have taken the lead in specifying what pro-
fessional counselors need to know and be able to do in order to demonstrate assess-
ment competence, while others have focused on the question of why assessment
competence is intrinsic to effective counseling. This section explores the why, while
the section that follows focuses on the what (i.e., the training standards for profes-
sional counselors).
Whiston (2005) provided six reasons why professional counselors must become
proficient in the use of assessment procedures. Assessment proficiency is a profes-
sional expectation. The American Counseling Association's Code of Ethics (ACA,
2005a) dedicated an entire section to an explanation of ethical uses of tests, and the
Council for Accreditation of Counseling and Related Educational Programs
(CACREP), an organization that accredits university counselor education programs,
1 Chapter 1
dedicated one of its eight core curricular areas to the study of assessment. As a result,
the public expects professional counselors to be proficient in the use and interpreta-
tion of tests. In fact, the use of formalized assessment can frequently lead to a per-
ception of enhanced credibility on the part of clients (Goodyear, 1990; Sexton et al.,
1997). Efficient identification of problems usually results from the competent use of
tests (Anastasi & Urbina, 1997; Duckworth, 1990), and this efficiency is normally
increased when professional counselors use multimethod assessment batteries (Meyer
et al., 2001) rather than general interviewing procedures. Likewise, multimethod
and multirespondent assessment methods usually help professional counselors un-
cover diverse, even unique, client information (Meyer et al.) and even lead to client or
student insight and learning (Campbell, 2000; Sax, 1997). In addition, assessment
helps identify strengths and weaknesses of clients and students, and professional coun-
selors use this information to facilitate decision making (Drummond, 2000; Sax).
Frequently, clients who "see" objective testing results documenting their interper-
sonal and intrapersonal strengths and weaknesses develop the motivation to make
life decisions and to adjust their life course accordingly. Insightful realizations and
details of conversations that occur during the course of counseling are sometimes
forgotten or minimized as time goes on. Assessment results provide a concrete, visual
record that can be referred to time and again to bring the counseling back on course
and to show measurable progress. Now that we have addressed the why of assessment
in counseling, let us turn our attention to the "what."
Training Standards for Professional Counselors
The Council for Accreditation of Counseling and Related Educational Programs
(CACREP) is the national organization, affiliated with the American Counseling
Association, that accredits universities with counseling programs meeting rigorous
professional and curricular standards. CACREP offers accreditation for masters-level
specialty counseling programs in the areas of career counseling; college counseling;
community counseling; marital, couple, and family counseling and therapy; mental
health counseling; school counseling; student affairs counseling; and doctoral pro-
grams in counselor education and supervision. The specific standard addressing the
curricular requirements for assessment is Section II.K.7, found in Table 1.1. The
reader will note that these standards align very well with the content of this book.
Professional Counseling Organizations and Assessment
Numerous professional counseling organizations and licensing or certification
boards exist to promote best practices and develop policies and procedures that ad-
vocate for client or student needs and protect the public from harm. The American
Counseling Association (www.counseling.org) serves as the parent or umbrella or-
ganization for all professional counselors and various professional counselor special-
ties in the United States. In this context, counseling specialties (called divisions
within ACAs structure) are defined as counselor practitioner entities that have a
guild or occupational presence in the counseling profession and job market. The fol-
Basic Assessment Concepts 1 1
Table 1.1 Assessment curriculum standard from section II.K.7
of the CACREP 2001 Accreditation Manual
7. ASSESSMENT — studies that provide an understanding of individual and group
approaches to assessment and evaluation, including all of the following:
a. historical perspectives concerning the nature and meaning of assessment;
b. basic concepts of standardized and nonstandardized testing and other assessment
techniques including norm-referenced and criterion-referenced assessment,
environmental assessment, performance assessment, individual and group test and
inventory methods, behavioral observations, and computer-managed and computer-
assisted methods;
c. statistical concepts, including scales of measurement, measures of central tendency,
indices of variability, shapes and types of distributions, and correlations;
d. reliability (i.e., theory of measurement error, models of reliability, and the use of
reliability information);
e. validity (i.e., evidence of validity, types of validity, and the relationship between
reliability and validity);
f. age, gender, sexual orientation, ethnicity, language, disability, culture, spirituality, and
other factors related to the assessment and evaluation of individuals, groups, and
specific populations;
g. strategies for selecting, administering, and interpreting assessment and evaluation
instruments and techniques in counseling;
h. an understanding of general principles and methods of case conceptualization,
assessment, and/or diagnoses of mental and emotional status; and ethical and legal
considerations.
lowing are among the current 19 ACA divisions (specialty areas) with special inter-
ests in the professional practice of assessment:
■ American College Counseling Association (ACCA; www.collegecounseling.org)
■ American Mental Health Counselors Association (AMHCA; www.amhca.org)
■ American Rehabilitation Counseling Association (ARCA; www.arcaweb.org)
■ American School Counselor Association (ASCA; www.schoolcounselor.org)
■ Association for Assessment in Counseling and Education (AACE;
http://aace.ncat.edu)
■ Association for Counselor Education and Supervision (ACES;
www.acesonline.net)
■ International Association of Addiction and Offender Counselors (IAAOC;
www.iaaoc.org)
■ International Association of Marriage and Family Counselors (IAMFC;
www.iamfc.com)
■ National Career Development Association (NCDA; www.ncda.org)
All of these organizations' websites, mailing addresses, and phone number con-
tacts can be located through ACA's main website, www.counseling.org.
1 2 Chapter 1
Think About It 1 .1 Visit the ACA website at www.counseling.org or
link to any of the websites individually listed above. Which professional or-
ganizations offer services and products helpful to your development as a pro-
fessional counselor? Which are you interested in joining?
Another major influence in the counseling world is the American Psychological
Association (APA; www.apa.org). APA serves as an umbrella organization for many
other divisions dedicated to serving the public and the agenda of practitioner psy-
chologists, some of whom are referred to as counseling psychologists. APA divisions
serving specialties similar to ACA divisions include:
■ Division 17 — Society of Counseling Psychology (www.divl7.org)
■ Division 22 — Rehabilitation Psychology (www.apa.org/divisions/div22)
■ Division 28 — Psychopharmacology and Substance Abuse (www.apa.org
/divisions/div28)
■ Division 29 — Psychotherapy (www.divisionofpsychotherapy.org)
■ Division 42 — Psychologists in Independent Practice (www.division42.org)
■ Division 43 — Family Psychology (www.apa.org/divisions/div43)
■ Division 50 — Addictions (www.apa.org/divisions/div50)
A number of additional national associations exist that are not affiliated with
ACA or APA, but which have substantial counselor and therapist memberships and
legislative agendas, including:
■ American Association for Marriage and Family Therapy (AAMFT; www
.aamft.org)
■ Association for Addiction Professionals (NAADAC; www.naadac.org)
■ National Association of Social Workers (NASW; www.NASWDC.org)
Finally, all states have licensing boards that regulate the practice of psychology
and/or counseling within their borders. Because laws and regulations vary substan-
tially from state to state, necessary qualifications and what professional counselors
can do when practicing within these states also vary. Add to this the turf wars be-
tween psychologist and professional counselor licensing boards and professional as-
sociations that flare up in various states around the country, and the whole issue of
which assessments and tests professional counselors can administer and interpret,
where, and when can become quite confusing. It is unlikely that this situation will
change anytime soon. It is incumbent upon professional counselors to stay abreast of
practice developments within their state.
Assessment Training Standards
The area of psychological assessment is perhaps among the most contentious and
hard-fought battlegrounds in counseling. As this book goes to press, battles between
psychologists and professional counselors over the right to use psychological tests in
Basic Assessment Concepts 1 3
clinical practice are being fought in California, Indiana, Illinois, Louisiana, and
Maryland. Organizations, including the ACA, AACE, Association of Test Publishers
(ATP; www.testpublishers.org), and Fair Access Coalition on Testing (FACT;
www.fairaccess.org), are leading a national effort to allow qualified psychologists and
counselors access to psychological tests in clinical practice. An ongoing stumbling
block to access has been forging agreement on the term qualified. ACA recently de-
veloped a position statement on test user qualifications with the goal that the docu-
ment would serve as a consensus-building device (see Box 1.1).
Box 1.1 ACA Policy Statement on Test User Qualifications
Standards for Qualifications of Test Users
American Counseling Association
The professional qualifications essential to the use of tests in counseling arise
from a synthesis of knowledge, skills, and ethics. While some professional
groups are seeking to control and restrict the use of psychological tests,* the
American Counseling Association believes firmly that one's right to use tests
in counseling practice is directly related to competence. This competence is
achieved through education, training, and experience in the field of testing.
Thus, professional counselors with a master's degree or higher and appropri-
ate coursework in appraisal/assessment, supervision, and experience are qual-
ified to use objective tests. With additional training and experience, profes-
sional counselors are also able to administer projective tests, individual
intelligence tests, and clinical diagnostic tests. This training may occur in
graduate school, in post-grad professional development instruction, or in su-
pervised training in use of the test. Professional counselors are qualified to
use tests and assessments in counseling practice to the degree that they pos-
sess the appropriate knowledge and skills, including the following areas:
1 . Skill in practice and knowledge of theory relevant to the testing context
and type of counseling specialty.
Assessment and testing must be integrated into the context of the theory and
knowledge of a specialty area, not as a separate act, role, or entity. In addi-
tion, professional counselors should be skilled in treatment practice with the
population being served.
2. A thorough understanding of testing theory, techniques of test construc-
tion, and test reliability and validity.
Included in this knowledge base are methods of item selection, theories of
human nature that underlie a given test, reliability, and validity. Knowledge
of reliability includes, at a minimum: methods by which it is determined,
*For the purpose of this document, terms such as inventory, instrument, measure, and scale are en-
compassed by the terms test or assessment.
continued
1 4 Chapter 1
Box 1.1 continued
such as domain sampling, test-retest, parallel forms, split-half, and inter-item
consistency; the strengths and limitations of each of these methods; the stan-
dard error of measurement, which indicates how accurately a person's test
score reflects their true score of the trait being measured; and true score the-
ory, which defines a test score as an estimate of what is true. Knowledge of
validity includes, at a minimum: types of validity, including content, crite-
rion-related (both predictive and concurrent), and construct methods of as-
sessing each type of validity, including the use of correlation; and the mean-
ing and significance of standard error of estimate.
3. A working knowledge of sampling techniques, norms, and descriptive,
correlational, and predictive statistics.
Important topics in sampling include sample size, sampling techniques, and
the relationship between sampling and test accuracy. A working knowledge of
descriptive statistics includes, at a minimum: probability theory; measures of
central tendency; multi-modal and skewed distributions; measures of variabil-
ity, including variance and standard deviation; and standard scores, including
deviation IQ's, z-scores, T scores, percentile ranks, stanines/stens, normal
curve equivalents, grade- and age-equivalents. Knowledge of correlation and
prediction includes, at a minimum: the principle of least squares; the direc-
tion and magnitude of relationship between two sets of scores; deriving a re-
gression equation; the relationship between regression and correlation; and
the most common procedures and formulas used to calculate correlations.
4. Ability to review, select, and administer tests appropriate for clients or
students and the context of the counseling practice.
Professional counselors using tests should be able to describe the purpose
and use of different types of tests, including the most widely used tests for
their setting and purposes. Professional counselors use their understanding of
sampling, norms, test construction, validity, and reliability to accurately as-
sess the strengths, limitations, and appropriate applications of a test for the
clients being served. Professional counselors using tests also should be aware
of the potential for error when relying on computer printouts of test inter-
pretation. For accuracy of interpretation, technological resources must be
augmented by a counselor's firsthand knowledge of the client and the test-
taking context.
5. Skill in administration of tests and interpretation of test scores.
Competent test users implement appropriate and standardized administra-
tion procedures. This requirement enables professional counselors to provide
consultation and training to others who assist with test administration and
scoring. In addition to standardized procedures, test users provide testing en-
vironments that are comfortable and free of distraction. Skilled interpreta-
tion requires a strong working knowledge of the theory underlying the test,
Basic Assessment Concepts 1 5
test's purpose, statistical meaning of test scores, and norms used in test con-
struction. Skilled interpretation also requires an understanding of the simi-
larities and differences between the client or student and the norm samples
used in test construction. Finally, it is essential that clear and accurate com-
munication of test score meaning in oral or written form to clients, students,
or appropriate others be provided.
6. Knowledge of the impact of diversity on testing accuracy, including age,
gender, ethnicity, race, disability, and linguistic differences.
Professional counselors using tests should be committed to fairness in every
aspect of testing. Information gained and decisions made about the client or
student are valid only to the degree that the test accurately and fairly assesses
the client's or student's characteristics. Test selection and interpretation are
done with an awareness of the degree to which items may be culturally bi-
ased or the norming sample not reflective or inclusive of the client's or stu-
dent's diversity. Test users understand that age and physical disability differ-
ences may impact the client's ability to perceive and respond to test items.
Test scores are interpreted in light of the cultural, ethnic, disability, or lin-
guistic factors that may impact an individual's score. These include visual,
auditory, and mobility disabilities that may require appropriate accommoda-
tion in test administration and scoring. Test users understand that certain
types of norms and test score interpretation may be inappropriate, depend-
ing on the nature and purpose of the testing.
7. Knowledge and skill in the professionally responsible use of assessment
and evaluation practice.
Professional counselors who use tests act in accordance with the ACA's Code
of Ethics and Standards of Practice (2005 a); Responsibilities of Users of
Standardized Tests — Third Edition {RUST-3) (AACE, 2003a); Code of Fair
Testing Practices in Education (JCTP, 2002); Rights and Responsibilities of Test
Takers: Guidelines and Expectations (JCTP, 2000); and Standards for
Educational and Psychological Testing (AERA/APA/NCME, 1999). In addi-
tion, professional school counselors act in accordance with the American
School Counselor Association's (ASCA's) Ethical Standards for School
Counselors (ASCA, 1992). Test users should understand the legal and ethical
principles and practices regarding test security, using copyrighted materials,
and unsupervised use of assessment instruments that are not intended for
self- administration. When using and supervising the use of tests, qualified
test users demonstrate an acute understanding of the paramount importance
of the well-being of clients and the confidentiality of test scores. Test users
seek on-going educational and training opportunities to maintain compe-
tence and acquire new skills in assessment and evaluation.
continued
1 6 Chapter 1
Box 1.1 continued
References
American Counseling Association. (2005a). Code of Ethics and Standards of
Practice. Alexandria, VA: Author.
American Educational Research Association, American Psychological
Association, National Council on Measurement in Education. (1999).
Standards for Educational and Psychological Testing. Washington, DC:
American Educational Research Association.
American School Counselor Association. (1992). Ethical Standards for School
Counselors. Alexandria, VA: Author.
Association for Assessment in Counseling. (2003a). Responsibilities of Users of
Standardized Tests (RUST). Alexandria, VA: Author.
Joint Committee on Testing Practices. (2000). Rights and Responsibilities of
Test Takers: Guidelines and Expectations. Washington, DC: Author.
Joint Committee on Testing Practices. (2002). Code of Fair Testing Practices
in Education. Washington, DC: Author.
Note: Reprinted with permission from the American Counseling Association. No further reproduc-
tion authorized without written permission from the American Counseling Association.
Note: Approved by the American Counseling Association (ACA) Governing Council in March 2003,
Anaheim, CA. The Standards for Test Use Task Force was an ad hoc committee of the American
Counseling Association. The following counseling and education assessment professionals con-
tributed to the drafting of this document: Dr. Bradley T. Erford (Chair), Mr. Alan Basham, Dr. Janet
Wall, Dr. Craig S. Cashwell, and Dr. Gerald Juhnke.
The Association for Assessment in Counseling and Education's (AACE)
Responsibilities of Users of Standardized Tests — Third Edition (RUST-3) (AACE,
2003a) statement is one of the most important documents speaking to standards for
test users. The RUST-3 statement addresses the issues of test user qualifications, tech-
nical knowledge, test selection, test administration, test scoring, interpreting test re-
sults, and communicating test results.
AACE is a division of ACA and has been collaborating with the practitioner
divisions of ACA (i.e., divisions that serve employment groups, such as school,
mental health, substance abuse, and marriage and family counselors) to develop
training standards for each specialty area. The goal of this initiative is to standard-
ize the assessment training within various counseling specialty areas so that all pro-
fessional counselors emerging from a counselor education program will have the
knowledge, skill, and training to use psychological tests relevant to their clinical
practice. The documents shown in Exhibits l.a and l.b, obtained from the
AACE/International Association for Addiction and Offender Counselors
(IAAOCC) and the AACE/American School Counselor Association (ASCA),
contain current assessment training standards for the specialty areas of substance
abuse counseling and school counseling. Assessment standards for mental health
counselors, career counselors, and marriage and family counselors are still under
Basic Assessment Concepts 1 7
ASSOCIATION FOR ASSESSMENT
IN COUNSELING AND EDUCATION
Standards for Assessment in Substance Abuse Counseling
These training standards provide a description of the knowledge and skills needed by substance abuse counselors in
the areas of assessment and evaluation. Because effectiveness in assessment and evaluation is critical to effective
counseling, these training standards are important for substance abuse counselor education and practice. Consistent with
existing Council for Accreditation of Counseling and Related Educational Programs (CACREP) standards for preparing
counselors, they focus on standards for individual counselors and the content of counselor education programs. The
standards, which represent aspirations for competent professional practice, can be used by counselor and assessment
educators as a guide in the development and evaluation of substance abuse counselor preparation programs, workshops,
in-services, and other continuing education opportunities. They may also be used by substance abuse counselors to
evaluate their own professional development and continuing education needs.
During training, substance abuse counselors should meet each of the following assessment standards and have the
specific skills listed under each standard.
Standard I. Substance abuse counselors are able to assess the effects and withdrawal symptoms of
commonly abused drugs. Substance abuse counselors can:
1. Assess for and recognize acute intoxication syndromes for commonly abused chemicals (i.e., alcohol, benzodiaz-
epines, marijuana, cocaine).
2. Assess for and recognize withdrawal complications (i.e., seizures, delirium tremens, hallucinations).
3. Assess for and recognize the effects of cross-addiction and dual addiction disorders.
4. Assess for and recognize symptoms of inhalant use (e.g. the smell of fuel on clothes, red eyes, runny nose,
cough).
Standard II. Substance abuse counselors can assess the broad spectrum of concomitant disorders. Substance
abuse counselors can:
1 . Assess for other addictive disorders (i.e., gambling, food, sex).
2. Determine if a psychological disorder (i.e., anxiety, depression, panic, Post Traumatic Stress Disorder) was present
prior to, or the result of, clients' substance use.
3. Assess for Attention-Deficit/Hyperactive Disorder (AD/HD).
4. Assess for suicidal or homicidal ideation.
5. Assess for the presence or possibility of domestic violence.
6. Use and interpret the results of adult and adolescent intelligence instruments.
Standard III. Substance abuse counselors are skilled in evaluating the technical quality and appropriateness
of testing instruments. Substance abuse counselors can:
1 . Identify acceptable reliability levels for instruments.
2. Identify appropriate types of validity for commonly-used instruments.
3. Evaluate the procedures used to validate commonly-used instruments.
4. Locate testing instruments and information about instruments for special populations (e.g. visually impaired,
nonreaders).
5. Use computerized assessment instrument.
6. Articulate the limitations of commonly-used instruments within the substance abuse counseling field.
Standard IV. Substance abuse counselors are knowledgeable regarding qualitative assessment procedures
including structured and semi-structured clinical interviews. Substance abuse counselors:
1. Are familiar with the advantages and disadvantages of structured and semi-structured clinical interviews.
2. Are familiar with qualitative assessment procedures (e.g. role playing, life line assessments, direct and indirect
observations).
3. Understand the advantages and disadvantages of qualitative assessment procedures.
4. Understand the concepts of continuous assessment and wraparound services.
Exhibit l.a Standards for Assessment in Substance Abuse Counseling
Source: Reprinted by permission of the Association for Assessment in Counseling/ American Counseling Association.
1 8 Chapter 1
Standard V. Substance abuse counselors employ multiple methods when assessing clients and monitoring
the efficacy of treatment Substance abuse counselors:
1. Use paper and pencil or computerized instruments and structured interviews, as appropriate.
2. Whenever possible, consult with and interview family, friends, and other corroborating sources of information,
while always obtaining written consent to gather information from sources other than the client.
3. Monitor client progress throughout the counseling process.
Standard VI. Substance abuse counselors are skilled in interpreting assessment results with clients.
Substance abuse counselors can:
1. Interpret assessment results in a helpful manner that emphasizes clients' strengths as well as possible problem
areas.
2. Explain to clients the steps that are necessary to share testing results with others (e.g. informed consent).
Standard VII. Substance abuse counselors are skilled in using assessment results to develop and evaluate
effective treatment interventions. Substance abuse counselors can:
1 . Accurately score, analyze, and interpret the results of testing.
2. Create specific treatment plans based upon the results of testing.
Standard VIII. Substance abuse counselors are aware of the need for professional development within the
assessment area. Substance abuse counselors:
1. Participate in training needed to keep abreast of new assessment instruments, procedures, and issues.
2. Keep up to date with advancements in the field of assessment by reading the appropriate professional journals,
test manuals, and reports.
3. Join professional associations that provide relevant assessment and substance abuse information.
Standard IX. Substance abuse counselors are aware of the appropriate use of assessment instruments in
research. Substance abuse counselors use assessment instruments:
1 . To determine the efficacy of their interventions.
2. Appropriate for the intended population/clients.
3. In accordance with the American Counseling Association's Ethical Standards, Code of Fair Testing Practices,
Standards for Educational and Psychological Testing, Responsibilities of Users of Standardized Tests, and Test
Takers' Rights and Responsibilities.
Standard XI. Counselor educators and supervisors of substance abuse counselors-in-training are able
to effectively train counselors in the area of substance abuse assessment Counselor educators and
supervisors:
1 . Keep current with scholarship related to how to teach counselors-in-training how to best use assessment
instruments in their work with clients.
2. Are knowledgeable in the selection, use, evaluation, and interpretation of assessment instruments.
Definitions of Terms
Assessment: active collection of information about individuals, populations, or treatment programs.
Instruments: standardized or nonstandardized tests, interviews, rating scales, inventories, or checklists used by mental health counselors
to better understand the client; the client's past history; the client's current social, employment, physical or interpersonal
environment; the client's intellectual functioning; the client's personality; or the client's presenting concerns.
Standards: minimal levels of skill, knowledge, or training.
Structured clinical interviews: clinical interviews with individuals, couples, families, or groups in which the mental health counselor asks
questions precisely as directed by the instrument's author(s). Questions are posed in the order defined by the authors, and
responses are recorded according to specific directions.
Unstructured clinical interviews: clinical interview in which the mental health counselor is free to pursue related lines of inquiry to gain
needed or pertinent information.
Source: Reprinted with permission from the Association for Assessment in Counseling and Education. No further reproduction authorized without
written permission from the Association for Assessment in Counseling and Education.
Wofe. These standards were developed as a joint effort between the Association for Assessment in Counseling and Education (AACE) and the
International Association of Addictions and Offenders Counselors (IAAOC). The joint committee included Dr. Bradley T. Erford (Chair), Dr. Gerald
Juhnke, Dr. Russell Curtis, Mr. Joe Jordan, Dr. Kenneth Coll.
Exhibit 1. a continued
COMPETENCIES IN ASSESSMENT AND EVALUATION FOR SCHOOL COUNSELORS
Approved by the American School Counselor Association
/^l^[[^te\ on September 21, 1998,
^^^^^^Jni5jM*wmu^ and by the Association for Assessment in Counseling
on September 10, 1998'
The purpose of these competencies is to provide a description of the knowledge and skills that school counselors need in the areas
of assessment and evaluation. Because effectiveness in assessment and evaluation is critical to effective counseling, these competencies
are important for school counselor education and practice. Although consistent with existing Council for Accreditation of Counseling and
Related Educational Programs (CACREP) and National Association of State Directors of Teacher Education and Certification (NASDTEC)
standards for preparing counselors, they focus on competencies of individual counselors rather than content of counselor education
programs.
The competencies can be used by counselor and assessment educators as a guide in the development and evaluation of school
counselor preparation programs, workshops, inservice, and other continuing education opportunities. They may also be used by school
counselors to evaluate their own professional development and continuing education needs.
School counselors should meet each of the nine numbered competencies and have the specific skills listed under each competency.
Competency 1. School counselors are skilled in choosing assessment strategies.
a. They can describe the nature and use of different types of formal and informal assessments, including questionnaires, checklists,
interviews, inventories, tests, observations, surveys, and performance assessments, and work with individuals skilled in clinical
assessment.
b. They can specify the types of information most readily obtained from different assessment approaches.
c. They are familiar with resources for critically evaluating each type of assessment and can use them in choosing appropriate
assessment strategies.
d. They are able to advise and assist others (e.g., a school district) in choosing appropriate assessment strategies.
Competency 2. School counselors can identify, access, and evaluate the most commonly used assessment instruments.
a. They know which assessment instruments are most commonly used in school settings to assess intelligence,
aptitude, achievement, personality, work values, and interests, including computer-assisted versions and other
alternate formats.
b. They know the dimensions along which assessment instruments should be evaluated, including purpose, validity,
utility, norms, reliability and measurement error, score reporting method, and consequences of use.
c. They can obtain and evaluate information about the quality of those assessment instruments.
Competency 3. School counselors are skilled in the techniques of administration and methods of scoring assessment
instruments.
a. They can implement appropriate administration procedures, including administration using computers.
b. They can standardize administration of assessments when interpretation is in relation to external norms.
c. They can modify administration of assessments to accommodate individual differences consistent with publisher
recommendations and current statements of professional practice.
d. They can provide consultation, information, and training to others who assist with administration and scoring.
e. They know when it is necessary to obtain informed consent from parents or guardians before administering an
assessment.
Competency 4. School counselors are skilled in interpreting and reporting assessment results.
a. They can explain scores that are commonly reported, such as percentile ranks, standard scores, and grade
equivalents. They can interpret a confidence interval for an individual score based on a standard error of
measurement.
b. They can evaluate the appropriateness of a norm group when interpreting the scores of an individual or a group.
c. They are skilled in communicating assessment information to others, including teachers, administrators, students,
parents, and the community. They are aware of the rights students and parents have to know assessment results
and decisions made as a consequence of any assessment.
d. They can evaluate their own strengths and limitations in the use of assessment instruments and in assessing
students with disabilities or linguistic or cultural differences. They know how to identify professionals with
appropriate training and experience for consultation.
e. They know the legal and ethical principles about confidentiality and disclosure of assessment information and
recognize the need to abide by district policy on retention and use of assessment information.
Source: Reprinted with permission from the Association tor Assessment in Counseling and Education. No further reproduction authorized without written
permission from the Association for Assessment in Counseling and Education.
'A joint committee of the American School Counselor Association (ASCA) and the Association for Assessment in Counseling (AAC) was appointed by the
respective presidents in 1993 with the charge to draft a statement about school counselor preparation in assessment and evaluation. Committee
members were Ruth Ekstrom (AAC), Patricia Elmore (AAC, Chair, 1997-1999), Daren Hutchinson (ASCA), Marjorie Mastie (AAC), Kathy O'Rourke (ASCA),
William Schafer (AAC, Chair, 1993-1997), Thomas Trotter (ASCA), and Barbara Webster (ASCA).
Exhibit l.b Competencies in Assessment and Evaluation for School Counselors
20 Chapter 1
Competency 5. School counselors are skilled in using assessment results in decision making.
a. They recognize the limitations of using a single score in making an educational decision and know how to obtain multiple
sources of information to improve such decisions.
b. They can evaluate their own expertise for making decisions based on assessment results. They also can evaluate the limitations of
conclusions provided by others, including the reliability and validity of computer-assisted assessment interpretations.
c. They can evaluate whether the available evidence is adequate to support the intended use of an assessment result for decision
making, particularly when that use has not been recommended by the developer of the assessment instrument.
d. They can evaluate the rationale underlying the use of qualifying scores for placement in educational programs or courses of
study.
e. They can evaluate the consequences of assessment-related decisions and avoid actions that would have unintended negative
consequences.
Competency 6. School counselors are skilled in producing, interpreting, and presenting s ta tistical information about
assessment results.
a. They can describe data (e.g., test scores, grades, demographic information) by forming frequency distributions, preparing tables,
drawing graphs, and calculating descriptive indices of central tendency, variability, and relationship.
b. They can compare a score from an assessment instrument with an existing distribution, describe the placement of a score within
a normal distribution, and draw appropriate inferences.
c. They can interpret statistics used to describe characteristics of assessment instruments, including difficulty and discrimination
indices, reliability and validity coefficients, and standard errors of measurement.
d. They can identify and interpret inferential statistics when comparing groups, making predictions, and drawing conclusions
needed for educational planning and decisions.
e. They can use computers for data management, statistical analysis, and production of tables and graphs for reporting and
interpreting results.
Competency 7. School counselors are skilled in conducting and interpreting evaluations of school counseling programs
and counseling-related interventions.
a. They understand and appreciate the role that evaluation plays in the program development process throughout the life of a
program.
b. They can describe the purposes of an evaluation and the types of decisions to be based on evaluation information.
c. They can evaluate the degree to which information can justify conclusions and decisions about a program.
d. They can evaluate the extent to which student outcome measures match program goals.
e. They can identify and evaluate possibilities for unintended outcomes and possible impacts of one program on other programs.
f. They can recognize potential conflicts of interest and other factors that may bias the results of evaluations.
Competency 8. School counselors are skilled in adapting and using questionnaires, surveys, and other assessments to
meet local needs.
a. They can write specifications and questions for local assessments.
b. They can assemble an assessment into a usable format and provide directions for its use.
c. They can design and implement scoring processes and procedures for information feedback.
Competency 9. School counselors know how to engage in professionally responsible assessment and evaluation
practices.
a. They understand how to act in accordance with ACA's Code of Ethics and Standards of Practice and ASCA's Ethical Standards for
School Counselors.
b. They can use professional codes and standards, including the Code of Fair Testing Practices in Education, Code of Professional
Responsibilities in Educational Measurement, Responsibilities of Users of Standardized Tests, and Standards for Educational and
Psychological Testing, to evaluate counseling practices using assessments.
c. They understand test fairness and can avoid the selection of biased assessment instruments and biased uses of assessment
instruments. They can evaluate the potential for unfairness when tests are used incorrectly and for possible bias in the interpreta-
tion of assessment results.
d. They understand the legal and ethical principles and practices regarding test security, copying copyrighted materials, and
unsupervised use of assessment instruments that are not intended for self-administration.
e. They can obtain and maintain available credentialing that demonstrates their skills in assessment and evaluation.
f. They know how to identify and participate in educational and training opportunities to maintain competence and acquire new
skills in assessment and evaluation.
Definitions of Terms
Competencies describe skills or understandings that a school counselor should possess to perform assessment and evaluation activities
effectively.
Assessment is the gathering of information for decision making about individuals, groups, programs, or processes. Assessment targets
include abilities, achievements, personality variables, aptitudes, attitudes, preferences, interests, values, demographics, and other
characteristics. Assessment procedures include but are not limited to standardized and unstandardized tests, questionnaires, inventories,
checklists, observations, portfolios, performance assessments, rating scales, surveys, interviews, and other clinical measures.
Evaluation is the collection and interpretation of information to make judgments about individuals, programs, or processes that lead to
decisions and future actions.
Exhibit l.b continued
Basic Assessment Concepts 21
development. Efforts such as these have the goal of standardizing and formalizing
the education and training required for professional counselors in various specialty
areas to effectively use psychological tests.
The right and responsibility to administer, score, and interpret psychological
and educational tests involve the concerted efforts of professional counselors, legis-
lators, state counseling board members, government bureaucrats, test publishers, ad-
vocates, professional associations and affiliates, and the public. Protection of this
right to test must occur continuously on several fronts, including laws, regulations,
ethics, professional training, and professional practice. Professional counselors are
encouraged to join professional associations and become actively engaged in legisla-
tive and regulatory advocacy to benefit and protect the public safety and right to ac-
cess quality, affordable counseling services.
ASSESSMENT TERMS AND CONCEPTS
The field of assessment contains many concepts that are essential to understand and
remember. These concepts vary in degree of simplicity, familiarity, and abstractness.
The list of terms and concepts presented in this section also serves as a way of clas-
sifying and describing most tests that professional counselors will encounter and use.
One of the things that makes assessment such a challenging area of study is its new
and unusual terminology, causing some professional counselors to suggest that as-
sessment is a language unto itself. In that spirit, the reader is well advised to spend
the time needed to master the concepts in the remainder of this chapter. These con-
cepts are the building blocks for understanding the field of assessment and for com-
prehending the content in the remainder of this book and in the published test man-
uals one will encounter.
Standardized (Formal) and Nonstandardized (Informal) Tests
Standardized tests have specific conditions for administration, timing, and scoring.
This systematic process ensures that no matter who the examiner or examinee, the
test will be administered under strict, replicable conditions. Standardized procedures
allow comparability of scores and interpretations across different examinees and for
the same examinee across administration times. Nonstandardized tests and other in-
formal measures do not provide systematic measurements, nor are the administra-
tion and scoring criteria fixed. Thus nonstandardized tests do not allow for compa-
rability across examinees or administration times. In addition, standardized tests
attempt to conform to rigorous test construction guidelines for establishing the re-
liability and validity of scores, whereas nonstandardized tests may not.
It is essential to understand that each method has advantages and disadvantages.
For example, when interviewing, the professional counselor can use a structured in-
terview (standardized), an unstructured interview (nonstandardized), or a semi-struc-
tured interview (standardized format with leeway for unstructured questioning). The
advantage of the structured interview is that different professional counselors inter-
viewing the same client will likely reach the same conclusion because they ask the
same questions and will probably get the same answers. This enhances the reliability
22 Chapter 1
(and probably the validity) of the procedure. On the other hand, different profes-
sional counselors interviewing the same client using an unstructured interview will
ask different questions, will likely get different results, and will possibly reach differ-
ent conclusions. The use of nonstandardized procedures more frequently leads to
variable results because of a lack of systematic methodology.
Norm-Referenced and Criterion-Referenced Tests
In most cases, standardized tests are administered to a representative sample of par-
ticipants, called a standardization sample, to determine average performances for
various subgroups of interest (e.g., age, grade, male, female). These subgroups are
often called a norm group. A client's score on this norm-referenced test can then
be compared to the standardization sample results to determine where the client's
score falls within that distribution of scores (i.e., Average, Above Average, Below
Average). Thus norm-referenced tests allow comparison of a person's score to the
scores of a comparison group with like characteristics (e.g., sex, age) that has al-
ready taken the test. Norm-referenced tests are commonly used to assess intelli-
gence, achievement, perceptual skills, personality, and behavior. Often the raw
score obtained by a client is transformed into some type of standard score or per-
centile rank. Note that the client's score simply indicates the individual's position
relative to others in the sample, not whether the client "passed" or "failed" the test
or is diagnosed with some mental disorder. Such judgments require the use of a
criterion.
Criterion-referenced tests compare a person's score to a predetermined standard
or level of performance — a criterion. Often a criterion-referenced test is administered
to a standardization sample to help establish the criterion scores. Criterion-refer-
enced tests are common in education because most teacher-made tests and perform-
ance-based assessments have a standard for determining successful performance. For
example, on a high-stakes state achievement test, a criterion for passing may be es-
tablished at a cutoff score of 79; thus any student scoring at 79 or higher has "passed"
the test; those below 79 did not. Likewise, on a depression screening test, a clinician
may determine that scores of 20 and higher require further diagnostic evaluation, so
a client receiving a score of 16 on the screening test would not meet the minimum
criterion. Many DSM-IV-TR diagnostic checklists are set up to facilitate criterion-
referenced decision making. For example, a diagnosis of Generalized Anxiety
Disorder requires the documentation of three or more of the six specific listed diag-
nostic criteria to a significant degree.
While most tests are designed to be norm referenced or criterion referenced,
some diagnostic, clinical, and research decisions are made by applying criterion-
referenced standards to norm-referenced results. For example, it is widely believed
that the prevalence of Attention-Deficit/Hyperactivity Disorder (AD/HD) in the
childhood population is about 5%. The Conners' Teacher Rating Scale — Revised
(CTRS-R) (Conners, 1997) is a norm-referenced behavior rating scale commonly
used in assessing AD/HD. The CTRS-R yields a T score (M = 50; SD = 10).
Applying the principles of the normal curve, it can be determined that a T score
of 67 or higher would represent the highest 5% (most hyperactive, most dis-
Basic Assessment Concepts 23
tractible) of a school-aged population. Thus, even though the CTRS-R is a norm-
referenced test, a clinician or researcher could use a criterion cut-score of T > 67
to identify children with AD/HD.
Individual and Group Tests and Inventories
Some tests and inventories are designed to be administered to only a single exami-
nee at a time; others are designed for administration to groups of participants simul-
taneously. The advantages of group tests are speed and efficiency. At the same time,
there are limitations in the type of group administration formats available, usually
involving paper-and-pencil and response booklet or Scantron (bubble) formats.
Professional school counselors most frequently use or encounter group assessments
involving achievement, aptitude, and ability within large-scale testing programs
(Gibson & Mitchell, 1999). A major drawback of group-administered assessment is
the inability to observe all examinees and control the factors that sometimes influ-
ence student performance, the most important of which is student motivation.
Individual tests are often used for diagnostic decision making and generally re-
quire some interaction between the examiner and examinee. They allow the exam-
iner to establish rapport, reduce anxiety, observe verbal and nonverbal behaviors, and
pace the evaluation by providing breaks to decrease fatigue. Often the tasks admin-
istered in an individual test require special training, expertise, materials, and timing
or scoring procedures that require individual attention. The individual administra-
tion format also gives the student or client the opportunity to demonstrate a deeper
mastery of skill by allowing the examiner to query responses and provide instruction,
and the examinee to clarify questions and task demands.
Objective and Subjective Tests
The terms objective test and subjective test refer to the method of scoring used in a
given testing procedure. Objective tests leave no doubt as to the correctness of a
given answer; correct answers are predetermined and require no judgment on the
part of the examiner. As a result, regardless of who scores the test, the result will be
the same. Multiple-choice, true-false items are examples of objectively scored ques-
tions. Subjective tests require the examiner to make a judgment on the quality of
the response in scoring an item. Essay, constructed-response, and open-ended ques-
tions ordinarily require some judgment. Objective items help to control subjective
bias in scoring procedures (i.e., help to improve interscorer reliability). Many client
characteristics assessed by professional counselors can be determined by objective
methods; other characteristics or issues in the lives of clients are more easily assessed
through subjective methods.
Speed and Power Tests
Different tests have differing classifications of item difficulty and response rates.
Speeded tests generally include a large number of simple items. The task is to meas-
ure how many of the simple items a person can complete within a certain amount
24 Chapter 1
of time. The rest is structured so that very few, if any, examinees complere all of the
items, and the score is simply the number of (correct) items completed within the
time limit (i.e., a person's response rate). Tests of fluency and processing speed com-
monly use speeded procedures. For example, the Math Fluency subtest of the WJ-III
(Woodcock, Mather, & McGrew, 2001) presents the examinee with 160 simple cal-
culation problems (i.e., 2 + 4 = ?, 1x4 = ?) within a three-minute time limit. The
examinee writes the number answer for each problem. The items are so simple that
very few errors are made, and the persons raw score is the number of items correct.
Obviously, the faster the examinee can compute and respond to simple math calcu-
lation problems, the higher the score.
A power test generally has fewer items, but they are of varying levels of diffi-
culty, and there are no time limits. The examinee can take as much time as needed
to work each problem, and the score is the number of items responded to correctly.
In some instances, more difficult items may be worth more points than less difficult
items. This kind of examination is called a power test because the score is an indica-
tor of the skills or abilities possessed by the examinee, without the pressure of time
limits. Generally, some items are so difficult that perfect scores are rare. When meas-
uring math computation skill, the Math Calculation subtest from the WJ-III may
be used. This subtest presents math calculation problems of varying difficulty levels
(i.e., 3 + 4 = ?; 420 x 24 = ?; 3 /4 - X A = ?; 2x+ 1 = 13, therefore x = ?), and the exam-
inee's raw score is an indicator of the amount of math skill possessed. The items vary
in difficulty, and most examinees eventually miss many items in a row (i.e., reach the
ceiling level), at which time administration of the subtest ceases. The more proficient
an examinee is in math calculation, the higher the person's score.
Interestingly, even though some tests are classified as pure speeded tests or pure
power tests, many tests include both facets — that is, they are designed as power tests
with varying item difficulties but are administered under time limits. Usually, these
time limits are sufficient for the majority of test takers to complete the examination.
However, slower (for whatever reason) test takers often run out of time. For exam-
ple, the Scholastic Assessment Test {SAT), commonly used for college admissions de-
cisions by American universities, is designed as a power test with items of widely
varying difficulties, but it is administered under time-limited conditions.
Importantly, time limit constraints frequently put disabled examinees at a distinct
disadvantage, which is why many students and adults with documented learning dis-
abilities or who receive accommodations under Section 504 of the U.S.
Rehabilitation Act of 1973 petition for and receive extended time accommodations.
Verbal and Nonverbal Tests
Some verbal tests rely heavily on language usage, particularly oral or written re-
sponses. These verbal responses require an examinee first to understand or compre-
hend instructions, questions, and other task demands; then to verbally mediate and
construct an appropriate response; and finally to deliver an oral or written response
that passes the scoring criterion for the item. Even if a task does not require a verbal
response, if the instructions are given orally, some verbal skill is required. Over the
Basic Assessment Concepts 25
oo
A
1 2
Figure 1.1 Matrix Design
past several decades, professional counselors have become acutely aware of the im-
pact of culture on language development and usage, particularly with persons for
whom English is not their primary language.
On the other hand, nonverbal tests require students and clients to solve and re-
spond to problems without the use of language. Sometimes these tests are called non-
language tests, or performance tests. (Note: The use of the term performance in the con-
text of nonverbal assessment here differs somewhat from its use in the section on
performance assessment later in this chapter.) For example, on a typical matrix anal-
ogy test, an examinee may be asked to look at several related designs and to select
from among several choices the design that would either complete the pattern or pre-
dict which design would appear next in the sequence (see Figure 1.1). Or, with block
pattern items, such as those found on the Slosson Intelligence Test — Primary (SIT-P)
(Erford, Vitali, & Slosson, 1999) (see Figure 1.2), a client may be given several cubes
(all black on two sides, all white on two sides, and half black-half white on the other
two sides), shown a picture of the blocks making a certain design, then asked to put
the blocks together so they look just like the picture. Such tasks minimize verbal
input and require spatial, figural, or visual processing skills — all nonverbal intellec-
tual processes.
It is easy to assume that someone who is very intelligent would excel at both ver-
bal and nonverbal tasks, that someone with average intelligence would perform in an
26 Chapter 1
Figure 1.2 Pattern Design
average capacity on verbal and nonverbal tasks, and that someone who is not very in-
telligent at all would do poorly on both types of tasks. Indeed, this is very frequently
the case, though by no means always. An intelligent non-English-speaking client or
a learning-disabled student may struggle tremendously on verbally laden tasks (as
would be expected) while performing in an outstanding manner on the nonverbal
tasks. Because culture influences language, examiners must take extra measures to
ensure the fairness of the examination (i.e., be unbiased). On the other hand, indi-
viduals with some degree of brain damage or a visual processing disorder, or those
who have an accelerated learning environment, may demonstrate verbal capabilities
far superior to their nonverbal capabilities. Most tests have some verbal component,
even if it is only some brief verbal or written instructions. It is the examiner's legal,
ethical, and professional responsibility to ensure that examinees receive a fair, unbi-
ased assessment that reflects the examinee's abilities to the greatest extent possible. In
all instances, professional counselors must take into account the extent to which lan-
guage and cultural influences may affect student or client results.
Cognitive and Affective Tests
Cognitive ability tests generally fall into three categories: intelligence, aptitude, and
achievement. They all measure, to various degrees, perceptual, processing, memory,
and reasoning capabilities. Intelligence tests measure a person's ability to learn, solve
problems, and understand increasingly complex or abstract information. Commonly
used tests of intelligence include the Wechsler Adult Intelligence Scale — Third Edition
(WAIS-III) (Wechsler, 1997) and the Stanford-Binet Intelligence Scale — Fifth Edition
(SBIS-5) (Roid, 2003). Aptitude tests, in general, predict a person's capacity to perform
some skill or task in the future (e.g., college, a training program). Aptitude tests have
broad educational and vocational applications. For example, the SA T has been used
for decades by university admissions personnel to determine which college applicants
are likely to do well in college (actually, the freshman year of college). Also, the
Differential Aptitude Tests (DAT) (The Psychological Corporation, 1991a) are com-
monly used as part of a vocational assessment battery to help high school students un-
derstand the potential vocational strengths and weaknesses each possesses.
Achievement tests are commonly used in education to measure knowledge students
have acquired through instruction or training up to a certain point in their academic
Basic Assessment Concepts 27
career. Achievement tests can be norm referenced (comparing the examinee with other
students) or criterion referenced (comparing the examinee with a standard of mas-
tery). Nearly all teacher-made, classroom-administered tests are achievement tests and
are usually criterion referenced. However, many individually administered diagnos-
tic- and screening-level tests have been developed, including the WJ-III (Woodcock,
McGrew, & Mather, 2001); the Wecbsler Individual Achievement Test — Second Edition
{WIAT-II) (Wechsler, 2001b); and the Peabody Individual Achievement Test — Revised
{PIAT-R) (Markwardt, 1998). Also, most states have mandated high-stakes achieve-
ment testing programs and contract with test publishers to develop standardized
achievement tests that align with specific state educational standards.
Affective assessment is a broad category that, in general, assesses all noncogni-
tive features of an individual, including temperament, clinical disposition, personal-
ity, attitudes, values, and interests. Both structured and unstructured assessments are
commonly used in affective assessment. Professional counselors frequently use struc-
tured (formal) personality inventories for diagnostic purposes, hypothesis testing, treat-
ment planning, and progress evaluation. Commonly used structured inventories in-
clude the Minnesota Multiphasic Personality Inventory — II (MMPI-2) (Butcher et al.,
1992); the Millon Clinical Multiaxial Inventory — III (MCMI-III) (Millon, Davis, &
Millon, 1997); and the Strong Interest Inventory (Harmon, Hansen, Borgen, &
Hammer, 1994). Unstructured (informal) assessment often involves the use of projec-
tive techniques and qualitative methods. Projective techniques are based on psycho-
analytic theory and normally present the client with unstructured, ambiguous stim-
uli, allowing the client to "project" thoughts and feelings onto the stimulus.
Examples of ambiguous stimuli include inkblots, pictures, incomplete sentences, or
even a single word. Such unstructured tasks give the client great latitude in how to
respond or as to the content of the response, and it is incumbent upon the profes-
sional counselor to analyze and interpret the responses to yield insights into a clients
motivation, personality, values, and so forth. An advantage of a projective technique
is that because it is ambiguous and there are no right or wrong answers, it is difficult
for clients to fake responses. Their responses were simply based on what came into
their mind at the time they responded to the task. A disadvantage of projective tech-
niques is that some of the tests require extensive education and training. Examples
of projective tests include inkblot techniques such as the Rorschach Inkblot Test
(Rorschach, 1969); picture-story techniques such as the Thematic Apperception Test
(TAT) (Murray & Bellak, 1973) and Robert's Apperception Test for Children
(McArthur & Roberts, 1994); drawing and query techniques such as the House-Tree-
Person (H-T-P) (Van Hutton, 1994) and Kinetic Drawing System for Family and
School (Knoff & Prout, 1985); and completion techniques such as incomplete sen-
tences or word association.
Maximum and Typical Performance Measurement
In maximum performance measurement, the professional counselor strives to assess
the best performance of which the examinee is capable. In this way, the examiner has
a good estimate of the upper level of achievement or ability at which the client could
be expected to perform. When conducting diagnostic assessment for the determina-
28 Chapter 1
don of a learning disability, the examiner strives to obtain maximum ability and
achievement estimates because such decisions have important long-term implications.
In typical performance measurement, the professional counselor seeks to ob-
tain a sample of the client's performance under normal circumstances, or on a "typ-
ical day." Professional counselors conducting clinical, personality, or vocational as-
sessments often strive for typical performance estimates to understand the client's
performance under normal circumstances. In this way, the professional counselor
gets to know the client's habitual thoughts, feelings, interests, and behaviors.
Behavioral Observations
Unfortunately, many people view assessment only as the administration of tests. But
assessment of any kind relies heavily on behavioral observations, observations that
begin at the moment the professional counselor speaks to or meets the client or stu-
dent for the first time. Observations can be conducted through either direct or indi-
rect means. One common form of direct observation is direct behavioral assessment, in
which the professional counselor is actually physically present in the same environ-
ment with the client and uses a data collection procedure to assess the frequency, du-
ration, and/or magnitude of one or more target behaviors. For example, a professional
school counselor may observe a 2nd-grade student referred for overactivity by using a
time-on-task observation system. Briefly, such a procedure allows the counselor to ob-
serve the frequency of the target student's (the student suspected of being hyperac-
tive) and of one or two control students' (students of the same sex, but not suspected
to be substantially hyperactive) motor on-task behavior during classroom activities.
Such observations allow the counselor to determine whether the target student is sub-
stantially more overactive than other children of the same age. Anecdotal observations
are also commonly used and allow the observer to document in a narrative format
what was observed during an observation period. The purpose of an anecdotal report
is to describe client behaviors in some detail so that, over time, a rich understanding
of the factors surrounding the behavior can be obtained. Often special training is re-
quired of observers to minimize bias and enhance inter-observer reliability (i.e., agree-
ment between the observations or ratings of two or more observers).
Behaviors can also be assessed through indirect observation, usually using behav-
ior rating scales or checklists. These instruments ask questions of people (e.g., spouse,
parent, teacher, peer) in a good position to observe the typical behavior of a student
or client and provide responses that give the professional counselor multiple perspec-
tives and valuable clinical insights. Some behavioral disorders (e.g., AD/HD) require
that problematic behaviors be observed in more than one setting, and behavior rat-
ing scales completed by parents or teachers help to verify student or client difficul-
ties in a time-efficient manner.
Basals, Starting Points, and Ceilings
Many intelligence, aptitude, and achievement tests present items in order of increas-
ing difficulty. For example, most subtests on the Woodcock-Johnson: Tests of
Achievement— Third Edition (WJ-III ACH) (Woodcock, Mather, & McGrew, 2001)
Basic Assessment Concepts 29
present items in approximate order from least difficult to most difficult. Likewise,
the Slosson Intelligence Test — Revised {SIT-R) (Nicholson & Hipshman, 1990) pres-
ents 1 87 verbal ability items in an order that approximates least to most difficult.
This hierarchical ordering allows administration procedures that substantially en-
hance efficiency and speed. Because the items are in approximate order from least
difficult to most difficult, it is logical to assume that if a student gets item 1 1 correct,
the odds are good that the student would also get items 1-10 correct, because each
is easier than item 1 1 . One can easily see how much faster administration would be
if any examinee getting item 11 correct would not need to answer items 1-10. Of
course, this is only an assumption, and exceptions do occur on a frequent basis.
However, test developers have determined that the probability of violating this as-
sumption diminishes tremendously when a series of consecutive items is used. A
basal series is a predetermined number of consecutive, correct items that must be
obtained by an examinee in order to eliminate the need to administer numerous eas-
ier items on the same test or subtest. For example, the SIT-R requires a basal series
of 10 in a row correct, while many subtests on the WJ-III ACH require a basal of 6
in a row correct. Establishing a basal series gives the examiner confidence that, if the
examinee were administered all the items preceding the basal, the examinee would
get them all correct. Again this is an assumption, but one backed up by substantial
statistical probability. The assumption is generally true in 95% or more of the cases,
and when it is not true, the examinee almost never misses more that one or two of
the easier items. Thus the examinees' scores ordinarily are not substantially inflated.
Of course, there is no need to establish a basal series if all examinees begin ad-
ministration with item 1 . That is why many test developers establish starting points
for administration based on the age or grade of the examinee. For example, an 8-
year-old being administered the 1 87-item SIT-R would ordinarily begin with item
55, a 13-year-old at item 105. These starting points are usually designated by deter-
mining the point at which nearly all (i.e., 95%) of 8-year-olds or 13-year-olds will
get the first item correct and go on to obtain the required basal series of 10 in a row
correct. For example, on the SIT-R, the examiner would begin administration to an
8-year-old with item 55, then continue until the basal series has been obtained. If the
student gets item 55 correct and responds correctly to items 56-64, the basal series
requirement has been met, and administration of the test items continues.
Different tests vary as to the proper procedure to follow if one of the items is
missed during the attempt to establish the basal series. Some require the examiner to
stop forward administration and administer the items in reverse order until the basal
has been established. As an example, imagine that an 8-year-old student being ad-
ministered the SIT-R answers items 55—60 correctly, then misses item 61. Because
the SIT-R requires a basal of 10 items and the student has only 6 in a row correct,
the professional counselor is required to return to item 54 and administer the items
in reverse order until the student responds to 10 items correctly (i.e., 54, 53, 52, 51).
At that point, having established the required basal series, the professional counselor
returns to item 62 and administers the remaining items until a ceiling is reached.
This example is provided in Figure 1.3.
A ceiling series is the number of incorrect items an examiner must obtain before
test administration can be halted. The concept of a ceiling is based on the same
30 Chapter 1
INDIVIDUAL TEST FORM
SIT-R
SLOSSON INTELLIGENCE TEST
Richard L Slosson
Revised by: Charles L. Nicholson, Terry L. Hibpshman
Nam* (John 5oy\
0~acK
LAST
FIRST
MIDDLE
AHrlrpss
Srhnnl/Agpnny
Spy M firade 3 Parpnt
Rpfprrpd Ry
NAME
POSITION
Fxaminpr
NAME
POSITION
Cnmmpnts-
Test Results:
Chronological Age (CA)
3-2
Yrs.-Mos
Raw Score
Total Standard Score (TSS)
Mean Age Equivalent (MAE)
T-Score
«?i
8-3
Normal Curve Equivalent (NCE)
Stanine Category
Percentile Rank (PR)
31
Confidence Interval (95%)or 99%) . .
(circle interval used)
°l2.)t>4-\oo
Mark the questions with a (1) for passing or a (0) for failing. Begin testing where examinee can pass "10 in a row" without
making a mistake. Continue testing until examinee misses "10 in a row." Refer to Manual for more complete directions.
NOTES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
61. o »fc r r
62°4^?H92.
,.91.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
j
o
i
o
o
o
o
o
o
o
o
o
*c«il
51.
52.
53.
54.
_j*55*5!:
56. _L_
57. _L_
58. _L_
59. _L_
60. _L_
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
93
94
95
96
97
98
99
100
101
102
103
.104
J?05._
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121.
122.
123.
124.
125.
126.
127.
128.
129.
130.
131.
132.
133.
134.
135.
136.
137.
138.
139.
140.
141.
142.
143.
144.
145.
146.
147..
148.
149.
150.
151.
152.
153.
154.
155.
156.
157.
158.
159.
160.
161.
162.
163.
164.
165.
166.
167.
168.
169.
170.
171..
172..
173..
174..
175..
176..
177..
178..
179..
180. .
181
182
183
184
185
186
187
Basal Item
Questions
passed after
basal item + .
Raw Score
(total of above)
Ceiling Item _
GO
(S3
Figure 1.3 Protocol for Slosson Intelligence Test— Revised
Source: Copyright 1991, Slosson Educational Publications, Inc. All rights reserved. Reprinted with permission from Slosson Educational
Publications, Inc. No further reproduction authorized without permission from Slosson Educational Publications, Inc.
Basic Assessment Concepts 3 1
#5 (Drawing Apples)
#75 (Drawing Apples)
#17 (Drawing Apples)
#23 (Drawing Apples)
#14 "Which of these squares is smaller?"
///
#109 Illustrate latitude and longitude.
Figure 1.3 continued
32 Chapter 1
Reliability
assumptions as those underlying a basal — that is, because the items continue on in
order of increasing difficulty, if a student misses item 60, there is a statistical likeli-
hood that the individual would miss items 61 and above because these items are even
more difficult. As with the basal series, the accuracy of that assumption is bolstered
when the test developer specifies a certain number of items in a row that must be
missed before administration can stop. The WJ-III A CH generally specifies a ceiling
of 6 incorrect items to cease administration of a subtest. The SIT-R specifies a ceil-
ing of 10 errors in a row. Continuing with our SIT-R example (see earlier discussion
and Figure 1.3), suppose the student responds correctly to item 62 and 63, misses
64, gets 65 correct, but then misses items 66-75. Missing the last 10 items in a row
fulfills the requirement of the ceiling series, so administration of the test ceases, and
the student is given points for all items above the ceiling series. The professional
counselor can then complete the scoring of the SIT-R protocol and transform the
raw score into standard scores and percentile ranks for interpretation. Note that the
assumption is that missing 10 items in a row means the student is very unlikely to
get any of the even more difficult items correct. Again, this assumption is almost al-
ways valid, but there is a negligible statistical probability that some examinees may
get one or more additional items correct should the administration of items con-
tinue. Just as before, denying an examinee 1 or 2 additional raw score points proba-
bly will not substantially suppress an examinee's overall score.
One can easily see the timesaving efficiency and benefits of using basal and ceil-
ing series. In our SIT-R example, the professional counselor administered only items
46-69. This means that a student's score was determined by administering only 24
of the 187 total items (about 13%). Basal and ceiling series can thus be tremendously
time saving without compromising on accuracy and meaningfulness. In addition,
these procedures save clients and students from having to endure the tedious admin-
istration of numerous items that are far too simple, and the emotional frustration of
having to deal with numerous items that are far too difficult.
Reliability is discussed in detail in Chapter 3. For now, it is important to know that
reliability means consistency. If a client receives an IQ score of 70 one day and 130
the next, what helpful decision could a professional counselor make about a client's
life? If a client's score cannot be consistently measured, it is of little use.
Reliability of scores can be determined through a variety of means, each of
which assesses a different type of score error. For example, test-retest reliability in-
volves determining the relationship (correlation) between scores on the administra-
tion of the same test to the same participants on two different occasions (e.g., one
hour, two weeks, one month, one year apart). The resulting coefficient is a measure
of the test scores' stability over time — essential information when trying to consis-
tently track a client's or student's performance or response to treatment over a given
period of time.
It is important to understand from the start that no test is reliable or unreliable.
It is test scores that possess the characteristic of consistency. Most importantly, the
Validity
Basic Assessment Concepts 33
reliability of test scores varies across samples of participants. For example, it is likely
that the reliability coefficients derived from scores on a substance abuse inventory
will vary substantially depending on whether clients who abuse substances or clients
who do not abuse substances are used in the sample.
Validity means usefulness. Validity of scores can be determined through a variety of
means, each of which provides evidence of a different type of usefulness. Content-
related validity is a systematic examination of the items making up a test to deter-
mine the comprehensiveness of content coverage. This type of validity is particu-
larly relevant for academic achievement tests because academic areas generally have
a well-established domain of behavior, and sampling is critical to deriving useful
generalizations from any derived score. For example, if a mathematics test is com-
posed only of addition problems, the scores may be valid indicators of a person's
addition skills but may be substantially less useful in predicting a person's overall
mathematical abilities.
Criterion-related validity involves a test's ability to predict some criterion, either
at the present time {criterion concurrent) or at some point in the future {criterion pre-
dictive). Many criteria are commonly used for comparison, and score validity is gen-
erally expressed as a specialized correlation coefficient known as a validity coefficient.
If one is attempting to validate scores from a new anxiety inventory, one may choose
criteria such as previously existing anxiety scales, behavioral observations, or diag-
nostic categorizations (i.e., previously diagnosed or currently diagnosable).
Construct validity helps determine what a test measures (the idea or construct)
and how well it measures it. A construct is a relatively abstract idea that cannot be
measured directly, but which can be inferred. Intelligence, depression, introversion,
self-esteem, and locus of control are all examples of constructs. Constructs can be
validated through a variety of methods, including factor analysis, correlations with
other tests, and convergent or discriminant techniques. Chapter 4 covers in detail
each of these classical methods for determining validity of scores, as well as decision-
making strategies using these scores.
As with the concept of reliability, no test is valid or invalid. It is test scores that
possess the characteristic of usefulness, and the validity of test scores varies across
samples of participants and according to the various purposes that a test is intended
to address. For example, it is likely that the validity of scores on a measure of self-es-
teem will vary substantially depending on the characteristics of the clients being as-
sessed; such as when a more homogeneous sample of adolescent Hispanic females with
eating disorders is studied, as opposed to a more heterogeneous sample of culturally
diverse males and females without diagnosable pathology. Likewise, that same self-
esteem scale may provide excellent, accurate predictions of academic self-esteem, but
only moderately accurate predictions of academic performance (i.e., grades, test
scores) and poor predictions of a client's degree of depressive symptoms. Different
tests are designed for different purposes and for use with different populations.
Validity is the study of these uses and populations.
34 Chapter 1
Formative Versus Summative Evaluation
Tests are often used to evaluate curricular and treatment programs. When a test is
administered during the course of treatment or instruction with the purpose of in-
forming the evaluator as to the intervention's effectiveness, it is called a formative
evaluation. Such a practice allows for midcourse adjustments and modifications
to more effectively meet the final goal or objective. If an assessment is adminis-
tered on completion of instruction or treatment, it is referred to as a summative
evaluation. The purpose of summative evaluation is to determine whether a goal
or objective has been met. How effective the assessment is in making this determi-
nation depends on the preciseness of the goal or objective, the alignment of the
treatment with the goal, and the alignment of the assessment with both the goal
and the treatment. For example, many counseling programs administer the
Counselor Preparation Comprehension Examination (CPCE) (administered by the
National Board of Certified Counselors [NBCC]) near the end of the program of
study as a summative evaluation. The test is predicated on core educational areas
(assessment, the topic of this text, is one of the more challenging core areas) and
well-defined educational standards. The test is composed of items selected to ac-
curately reflect the various domains of knowledge and the importance of each do-
main. Thus the CPCE is very well aligned with the standards it was designed to
measure. Professional counselors participate in a graduate counseling program of
study that prepares them for professional practice. Success on the exam is very
much related to how closely the graduate program curriculum aligns with the test's
standards, the skill of the instructors, factors listed in Table 7.1 (factors that affect
student or client test performance), and the tenacity with which the students pur-
sue and master the course contents. Stated another way, success on the CPCE, used
as a summative evaluation, is enhanced by well-designed programs with good in-
struction and, even more importantly, motivated, competent students. Study hard!
Paper-and-Pencil Tests and Performance (Authentic) Assessments
A paper-and-pencil test requires examinees to mark an answer choice, either through
the historically literal practice of using a pencil or through more recent computer-
based innovations such as clicking the correct answer displayed on a computer
screen. These tasks frequently rely heavily on verbal capabilities because they require
reading and verbal comprehension.
Performance assessments, sometimes called authentic or alternative assessments,
minimize verbal task demands but require the student or client to manipulate ma-
terials or to select visual stimuli without using language, or at least by substantially
minimizing the use of language. There is a big difference between completing a
multiple-choice test on how to rebuild car engines (a paper-and-pencil test) and
actually rebuilding a car engine (a performance test). Performance assessment in-
volves the evaluation of an examinee's product, action, or behavior. A strength of
performance assessment is that it allows the individual to demonstrate a more com-
prehensive, real-life, hands-on understanding of a topic or dilemma. Performance
assessments have been used for years in vocational training and gifted education
Basic Assessment Concepts 35
programs, not to mention physical education, woodshop, metal shop, and home
economics classes. Some state departments of education have implemented high-
stakes performance assessment systems to assess students' depth of understanding
by presenting them with a dilemma to be solved and the materials and time to solve
it. Such procedures are expensive and time consuming but allow examiners to de-
termine whether students develop necessary insights and follow desired procedures
en route to solving complex problems. Performance assessment is sometimes done
with less of an emphasis on reading and writing, thus minimizing the effects of
verbal and linguistic capabilities. But this is not always the case. Some states use
the manipulation of physical objects and props to solve a problem and then require
the student to write a summary composition describing the various components
of the performance task.
Professional training programs frequently use performance assessments. For ex-
ample, counselors-in-training frequently present videotapes of counseling sessions
for analysis and evaluation, and interns and practicum students are sometimes ob-
served and evaluated in live counseling, consultation, or classroom sessions.
Instructors or supervisors then observe the demonstrations, evaluate and judge each
performance according to some scoring scheme (usually involving a scoring rubric),
and provide feedback regarding the student's or intern's performance. A scoring
rubric provides the rules to be followed when assessing the quality of a performance.
Generally, the rubric is a rating scale or checklist of essential elements that must be
included in the product. Point values are assigned according to the quality of each
component.
Popham (1999) indicated that three components must underlie authentic per-
formance assessments: (1) Multiple evaluative criteria must be used; (2) each of the
evaluative criteria must be clearly articulated and defined prior to judging the per-
formance; and (3) human judgments are necessary to determine the acceptability of
performance responses. It is this final component that critics of performance assess-
ment take issue with. The acknowledged weakness of performance assessment is the
difficulty of establishing the reliability and validity of scores — which are critical re-
gardless of the type of assessment undertaken. Because performance assessment is
time consuming, it may be possible to complete only one or several problems (i.e.,
authentic science problems to be solved, perhaps even a single "experiment") over
the course of a two-hour examination, whereas a student may be able to complete
more than 100 multiple-choice problems during the same period. An important sta-
tistical concept within test development is that, all else being equal, the more items
a test possesses, the more reliable the scores on that test (Anastasi & Urbina, 1997).
Because human judgment (i.e., subjectivity) is required in performance assessment,
interscorer reliability becomes an important issue. In nearly all circumstances, the
multiple-choice test will be more reliable than the performance test, and test scores
can be no more valid (useful) than they are reliable (consistent). Thus there is a trade-
off in using paper- and-pencil and performance tests. Paper-and-pencil tests may be
more efficient and psychometrically superior (i.e., have a higher reliability of scores),
but performance assessments may get closer to the real-life circumstances for which
a student is being prepared. These dilemmas are explored in detail in the chapter on
high-stakes testing, which is available on the companion website for this text.
36 Chapter 1
Practically speaking, as the owner of a car with a blown engine, who would
you rather have working on your car: the mechanic who got more multiple-choice
questions right or the one who rebuilt the engine in the quicker, more proficient
manner? Perhaps a bit closer to home, who might a client prefer as a professional
counselor: the one who received the higher score on the National Counselor
Examination (NCE) or the one who performed better on the videotapes? If you
said, "The one who did better (or well) on both," you can count yourself among
a growing segment of professionals who see the benefits of both approaches.
Breadth and depth are both critical elements of comprehensive assessment.
Portfolio Assessment
Portfolio assessment is a specific, and currently popular, type of performance assess-
ment espoused by proponents of the philosophy that instruction and assessment are
one and the same. A portfolio is a systematic and well-organized collection of work
produced by an individual with the purpose of demonstrating that individual's skills
and achievements. Portfolios have been used in the professions of art, architecture,
modeling, journalism, and photography for years. In these professions, the individual
selects exemplary works that demonstrate competence, style, talent, and versatility. In
many counseling programs, counselors-in-training are required to develop a portfolio
of exemplary works (e.g., counseling tapes or analyses, course papers or projects,
events or lessons implemented, ancillary products developed). Portfolios are a wonder-
ful way for interns to demonstrate for program faculty members the depth of their
learning and understanding, and for potential employers the likely quality one could
expect of the applicant if hired as an employee. However, portfolio assessment presents
examiners with a couple of challenging problems: How does one go about evaluating
the quality of a portfolio? Will the assessment lead to reliable and valid results?
By now, this problem should sound familiar, and the reader should have some
ideas as to how to solve the dilemma. Because portfolio assessment is a type of per-
formance assessment, rubrics and other issues discussed in the performance assess-
ment section also apply here. What is critical is that evaluators of portfolios acknowl-
edge that the assessment system devised must conform to the highest level of
technical adequacy possible. If it does not, students and evaluators will waste much
time and effort on an assessment process that is difficult (perhaps impossible) to eval-
uate. Such an assessment system could be perceived as burdensome, worthless, un-
fair, and even biased.
It is widely agreed that assessment of portfolios should involve both a self-assess-
ment and an external assessment (Farr & Tone, 1994; Popham, 1999). In a self-as-
sessment, the student provides evaluative commentary of the included works and how
each meets certain requisite standards or demonstrates required mastery. The encour-
agement of self-evaluation is an important developmental skill in its own right and
is a strength of the portfolio process. External assessment involves the process of ob-
taining judgments from professionals not related to the situation in which the works
were created, but in a good position to evaluate those works. For example, in the ex-
ample of a counselor-in-trainings portfolio, it is likely that program faculty would
be somewhat biased in their evaluation of student works. Indeed, studies have shown
Basic Assessment Concepts 37
Table 1.2 Advantages and disadvantages of portfolio
(and performance) assessment
Advantages
1. Focuses on "doing."
2. Allows for demonstration of examinee strengths, flexibility, and adaptability.
3. Highlights improvements rather than comparisons.
4. Focuses on processes and products.
5. Provides self- assessment and analysis.
6. Assesses depth of understanding and application of instruction.
7. Integrates knowledge, skills, and abilities.
8. Allows diagnosis of strengths and weaknesses.
9. Provides concrete examples of application of skills.
10. Facilitates performance-based instruction.
Disadvantages
1 . Evaluation process is time-intensive for students and evaluators.
2. Useful and accurate rubrics are difficult to create.
3. Interscorer reliability is low.
4. Judges require a lot of training.
5. Stakeholders often have difficulty understanding the results.
6. Performance tasks must be well crafted and meaningful.
7. Performance on one task is often unrelated to performance on other tasks.
8. Students are frequently unsure which products to include and why.
9. Performance tasks are difficult and frustrating for low-ability students.
10. Some cultural or socioeconomic groups may underperform on certain types of performance
tasks (i.e., bias).
that teachers tend to be biased toward their own students' work (Popham, 1999).
Thus it would be best to solicit volunteers from the professional community unre-
lated to the program or students.
Rubrics established for portfolio assessment must be specifically written and dis-
tributed to students well in advance so they can prepare showcase or best-work port-
folios that will address the portfolio standards. Alternatively, students can be encour-
aged to develop portfolios that demonstrate growth and learning over time.
Unfortunately, compared to most other types of assessments, portfolio assessment
tends to be time consuming, expensive, and lacking in technical rigor (i.e., reliabil-
ity and validity). All in all, the portfolio assessment process presents numerous dif-
ficult challenges, and "to date, the results of efforts to employ portfolios for account-
ability purposes have not been encouraging" (Popham, 1999). Table 1.2 presents a
number of advantages and disadvantages of portfolio assessment.
Think About It 1 .2 Imagine that you are preparing for an employment
interview. What kinds of "products" from your courses and clinical experi-
ences would you include in your portfolio to demonstrate your effectiveness
as a professional counselor?
38 Chapter 1
Environmental Assessment
Environmental assessment moves the focus of assessment and evaluation from the
individual to the environment in which the individual functions. In workplaces, re-
lationships, and other social situations, clients often complain that they "don't fit in."
Normally, the focus of counseling is on how clients can change to better adapt to
their environment and circumstances. But what if the environment could be altered,
or changed altogether? For example, clients with an alcohol dependency may bene-
fit from an analysis of the "who," "where," and "what" related to their social activi-
ties. Such an analysis may point to factors that are actually barriers to recovery and
abstinence. In schools, the contingencies in a classroom may become the primary
focus of a behavioral observation and assessment to determine what environmental
factors may account for the difficulties a student may be encountering. In families,
professional counselors are keenly interested in the family environment so that sys-
temic changes can be made to get a family moving in a more positive direction.
As a more specific example from the career realm, employees sometimes com-
plain about workplace conditions, stress, and burnout but continue to work in such
environments for years and years. Some researchers (e.g., Holland, Gottfredson) are
addressing this problem by designing measures that assess the environmental context
and the individual, using Holland's model, featured in the Self-Directed Search (SDS)
(Holland, Fritzsche & Powell, 1994), to determine whether the client's interests and
competencies actually match the demands of the work environment. In a simplistic
extension, individuals who need to be physically active but who have a job that re-
quires a lot of desk work may experience a "disconnect" and unhappiness. Likewise, a
"people person" may be unhappy slogging away in a cubicle all day, or at least trying
to survive until lunch or quitting time. In both of these circumstances, altering or
changing the environment or work tasks within an environment may be the solution
to client concerns. In some form or another, environmental assessment has been
around for decades, but it appears currently to be experiencing a resurgence.
Computer-Managed, Assisted, and Adapted Assessment
Computers are in the process of revolutionizing psychological and educational as-
sessment. As far back as the 1930s, technological innovations have helped make the
process of assessment more efficient and accurate, usually by speeding up scoring
procedures for large-scale test administrations. With the widespread availability of
personal computers since the 1970s, test publishers have actively pursued the pro-
duction of computer software that allows a clinician's computer to administer, score,
and even interpret a client's protocol in the comfort and convenience of the clini-
cian's own office. With easy access to the Internet through home and office com-
puters and public venues such as libraries, the possibilities for computer-assisted
assessment have become nearly unlimited. Of course, these wonderful access oppor-
tunities have arrived with a plethora of ethical and legal dilemmas.
Computer-managed assessment, also known as computer-assisted assessment, in-
volves the harnessing of computers to administer, score, and interpret tests. Some
Basic Assessment Concepts 39
software packages or Internet sites allow an integration of all three of these functions,
while others may allow only one or two. Integrated functions are becoming more
and more the standard. Today, computers can even store and accumulate test results
for a single client or an entire school system in order to manage and compile sum-
mary reports. The implications of such computer-managed systems are phenomenal
because such databases can facilitate everything from individual treatment plans to
outcome assessment of an agency's clientele or the evaluation of an entire school sys-
tem's curriculum.
Many historically paper-and-pencil tests are now available in online or per-
sonal computer versions. Using individualized computer-assisted assessment, the
student or client generally completes the assessment at a personal computer on
which specialized software has been installed, or on a computer linked to an
Internet website offering the service. Responses are easily made by clicking the
mouse on an appropriate answer space, using a touch screen, or typing a response.
Frequently, tests can be automatically scored and an interpretive report printed out
within seconds after completing the test. The comprehensiveness and quality of
these reports vary substantially. For example, the computerized packages offered
by Pearson to administer, score, and interpret the MCMI-III (Millon, Davis, &
Millon, 1997) and the MMPI-2 (Butcher et al., 1992) include comprehensive, de-
tailed narratives of likely examinee characteristics and behaviors as well as diagnos-
tic and treatment implications — for about $20 a client. On the other hand, the
WISC-IV and WJ-III ACH computer scoring programs, which come with the stan-
dard test kit package, provide only basic scoring and storage functions. When con-
sidering the purchase of assessment software or other scoring services, it is a good
idea to ask the publisher for samples of reports to determine whether they will meet
one's professional needs — at a reasonable price.
There are several advantages to computer- assisted assessment. Depending on the
program, cost savings can be substantial, particularly given the speed of scoring and
interpretive reports. Some tests and inventories may require hours to comprehen-
sively score and interpret, whereas the computer program for the same test or inven-
tory may require only seconds. Clients and students also have much greater control
over the rate of response and interaction; thus individuals who desire a quicker or
slower pace are accommodated by the computer. In addition, clients with special
needs can sometimes be better accommodated by computer administration. Clients
with visual handicaps or reading problems and who need to have items read orally
may find auditory computer administration more user-friendly. Clients with writ-
ing disabilities may find the mouse or keyboard easier to manage than a pen or pen-
cil. Students with visual processing disorders may find the auditory instruction ca-
pabilities and larger graphical displays of computers easier to adjust to, as opposed
to a bubble form that may look like a jumbled mess. Clients with attentional prob-
lems may find the computer administration more engaging than a response booklet.
The possibilities for accommodating clients with disabilities are substantial.
Computer-adapted assessment involves an interactive process between the ex-
aminee and the computerized assessment device. Computer-adapted assessment
usually entails varying administration formats depending on the responses of the
40 Chapter 1
examinee to previous questions. For example, when taking the computer-adapted
version of the Graduate Record Exam (GRE) administered by the Educational
Testing Service (ETS), two examinees may be administered very different item sets
depending on their abilities. On the paper-and-pencil administration of the GRE,
all examinees respond to (virtually) the same set of questions, whether the student
is of high or low ability. This leads to high-ability students answering questions
that are mostly far too easy and lower-ability students answering questions that are
mostly far too difficult. Computer-adapted testing solves this dilemma by estab-
lishing a bank of items for which the item difficulties and other technical item
characteristics are already known. An examinee with strong ability will be admin-
istered an item of moderate difficulty, respond to it correctly, and receive an even
harder question. The computer automatically scores the item, tracks performance,
and is programmed to administer subsequent items until a very good estimate of
the student's performance is obtained. Generally, students who respond correctly
to an item continue to receive more and more difficult items until a plateau in per-
formance occurs. At this point, the administration stops, and a final score is deter-
mined. Note that a higher-performing student never receives the easier items in
the item bank, but continues to be administered items of ability-appropriate dif-
ficulty. For a student with lower ability, an incorrect response to the first moder-
ately difficult item will be followed by an easier item. If this second item is missed,
an easier item follows; if the response to this item is correct, a more difficult item
follows. The process continues again until a plateau in performance is reached.
Note that the lower-performing student is never administered the more difficult
items that a higher-performing student receives, but continues to receive the more
appropriate, less difficult items.
Many aptitude and achievement tests now offer a computer-adapted adminis-
tration format. Generally, examinees complete computer- adapted assessments in less
time than paper-and-pencil administrations, and the results are available instanta-
neously, rather than in the typical weeks-to-months wait time for mail-in scoring
services. It is likely that computer-adaptive testing will eventually be used in other
areas of assessment, particularly clinical, personality, and career assessment. For ex-
ample, self-report during a computerized structured clinical interview protocol could
allow clients to respond negatively to essential features of a major diagnostic cate-
gory (e.g., "depressed mood or loss of interest or pleasure in normal activities") and
subsequently skip all associated structured interview questions related to a disorder
that is not applicable. The elimination of inappropriate items could yield a large time
savings.
The advantages of using computers for assessment are many. Computers are not
prone to bias; they do not discriminate on the basis of sex, race, ethnicity, sexual ori-
entation, and so forth, as some clinicians may. It is also far easier to revise adminis-
tration, scoring, and interpretive procedures when an examination is online, because
the changes are instantaneous. These features offer a real advantage over paper-and-
pencil administration, in which some professionals may continue to use older ver-
sions of a revised test simply because they have a stockpile of the older protocols they
wish to use up, for economy's sake, before ordering newer materials. In addition,
Basic Assessment Concepts 41
there is some evidence that clients may self-disclose sensitive information more hon-
estly during computer administration, because of greater perceived anonymity, than
during the face-to-face disclosures that occur during a typical interview (Davis,
1999; Joinson & Buchanan, 2001). Perhaps the most overlooked advantage of on-
line assessment is the potential for access to quality services by professional coun-
selors, clients, and students who are geographically isolated or in some other way un-
able to participate in more mainstream mental health services.
Important disadvantages of using computers for assessments relate to observa-
tion and comfort issues. Generally, when computerized assessments are used, the
professional counselor is occupied elsewhere and not focused on observing the client
or student engaged in the assessment process. Much helpful information can be lost
when a client's assessment-related behaviors go unnoticed. To compound this issue,
computer-generated interpretive reports are often accepted at face value by clinicians
and imported wholeheartedly into reports and summaries. As is mentioned in the
ethics discussion in Chapter 2, computerized interpretive reports are considered pro-
fessional-to-professional consultations, and the burden of what to report and what
not to report lies with the professional counselor charged with the care of the client.
Computer-generated reports are meant to supplement a clinician's interpretation,
not supplant it. Sampson, Purgar, and Shy (2003, p. 27) suggested that professional
counselors should have, at a minimum, the following competencies to use computer-
based test interpretation (CBTI) information effectively:
1 . An understanding of the construct or behavior
2. An understanding of the test, including the theoretical basis (if any), item selec-
tion and scale construction, standardization, reliability, validity, and utility
3. An understanding of the test interpretation, including scale interpretations and
recommended interventions based on scale scores
4. An understanding of the CBTI, including the equivalence of test forms (if inter-
pretations from an original form are used) and the evidence of CBTI validity
5. Initial supervised experience in using the test and CBTI (with supervision pro-
vided by an appropriately qualified practitioner)
While computers are becoming more commonly used, tremendous diversity in
use currently exists. People vary in their experience with and attitudes toward com-
puters. While most people appear favorably disposed toward computers, group and
individual differences have been noted. For example, Barak (2003) observed that un-
easiness with technology led to lower performance tendencies in women in online as-
sessments. While this empirical result has not been consistently verified, professional
counselors are well advised to ensure computerized assessment technologies do not
hold some groups at performance disadvantages.
Of course, the proliferation of computer-based assessment services is not with-
out a cost. Frequently, online tests are not developed with the same attention to
technical rigor as the print versions of standardized tests, and information on the
reliability and validity of online scores is sometimes impossible to obtain. In addi-
tion, expert verification of rigor is more challenging because the testing experts
may need to be familiar with sophisticated computer programming language in
42 Chapter 1
order to evaluate the interpretive procedures programmed into the software. The
security and confidentiality of online assessments continue to be of major concern
in the industry, although new encoding and encryption software shows promise in
resolving these issues. With paper-and-pencil tests, the responsibility for the secu-
rity of the tests and test results falls squarely on the professional counselor, who
can frequently secure the information under lock and key. The issue of security
and confidentiality becomes more complex when personal computers and Internet
providers are involved, and professional counselors must take great care to ensure
the security of the tests and the integrity of the assessment process. Finally, the
Standards for Educational and Psychological Tests and Manuals (AERA/APA/NCME,
1999) specify that examinees offered a choice between computerized and paper-
and-pencil tests should be educated about the features, characteristics, and pros
and cons of each type of administration format.
Think About It 1.3 What are the ways you anticipate using computers
in your practice as a professional counselor? Think about and seek out the
type of training you will need.
SUMMARY/CONCLUSION
This chapter has addressed purposes, standards, and terminology related to profes-
sional use of assessment. Assessment has four purposes: screening, diagnosis, treat-
ment planning and goal identification, and progress evaluation. Each contributes
substantially to the overall counseling process. In addition, the counseling field has
a number of sources intended to guide assessment education and practice. The
Council for the Accreditation of Counseling and Related Educational Programs
(CACREP) has established curricular requirements, and a number of professional
organizations have developed standards to aid counselors in understanding good as-
sessment practices. Finally, professional counselors need to be familiar with the terms
and phrases essential to the field in order to communicate effectively with other pro-
fessionals, to advocate for clients and students, and to make decisions in their best
interests.
KEY TERMS
affective assessment
assessment
basal series
behavior
behavioral observations
ceiling series
cognitive ability test
computer-adapted assessment
computer-managed assessment
criterion-referenced test
diagnosis
environmental assessment
formative evaluation
group test
Basic Assessment Concepts 43
individual test
maximum performance measurement
nonstandardized test
nonverbal test
norm-referenced test
objective
objective test
performance assessment
portfolio assessment
power test
projective technique
psychological test
reliability
sampling
screening
speeded test
standardization
standardized test
starting point
subjective
subjective test
summative evaluation
test
typical performance measurement
validity
verbal test
CHAPTER
2
Foundations of Assessment:
Historical, Legal, Ethical,
and Diversity Perspectives
by Bradley T. Erford, Cheryl Moore-Thomas, and Lynn Linde
This chapter highlights the historical, legal, ethical, and diversity issues impor-
tant to a professional counselor's understanding of assessment. From ancient
times through modern day, assessment has been important to humankind's
self-understanding, and a tool both for fairness and oppression, however intended.
While many historical events were important to the evolution of assessment in gen-
eral, this chapter explores events relevant to assessment in the more specialized areas
of intelligence, achievement, career, and clinical and personality. Professional coun-
selors are engaged in a variety of ways in ensuring that clients and students receive
appropriate assessment in these areas. Therefore, a review of legal, ethical, and pro-
fessional standards regarding assessment, diversity factors affecting assessment, and
test bias is also provided. The chapter concludes with a discussion of strategies, coun-
seling interventions, and recommendations to ensure fair testing.
THE HISTORY OF ASSESSMENT
Throughout recorded history, people have attempted to measure and assess human
characteristics and traits. What follows is a brief exploration of these attempts over
more than the past three millennia, segmented into three historical periods: ancient
times, measurement in the laboratory, and modern clinical applications. A summary
timeline of historic events in the field of assessment is included in Table 2.1.
45
46 Chapter 2
Table 2.1 Assessment timeline
500 BCE Greeks may have used assessments for educational purposes.
220 BCE Chinese set up civil service exams to select mandarins.
AD 1219 English university administers first oral examination.
ca. 1510 Fiteherbert proposes first measure of mental ability (identification of one's age, counting 20 pence).
1 540 Jesuit universities administer first written exams.
1575 Spanish physician Huarte defines intelligence in Examen de Ingenius (independent judgment, meek
compliance when learning).
1 599 Jesuits agree to rules for administering written exams.
1636 Oxford University requires oral exams for degree candidates.
1692 German philosopher Thomasius advocates for obtaining knowledge of the mind through objective,
quantitative methods.
1799 In working with the "Wild Boy of Aveyron," Itard differentiates between normal and abnormal cognitive
abilities.
1803 Oxford University introduces written exams.
1809 Gross develops theory of observational error.
1834 Weber, pioneer in the study of individual differences, studies awareness thresholds.
1835 Quetelet develops and studies normal probability curves.
1837 Seguin develops the Seguin Form Board Test And opens school for mentally retarded children.
1838 Esquirol advocates differences between mental retardation and mental illness, proposes that mental
retardation has several levels of severity.
1869 Galton, founder of individual psychology, authors Hereditary Genius, sparking study of individual differences
and cognitive heritability.
1879 Wundt establishes world's first psychological laboratory at the University of Leipzig in Germany.
1888 J. M. Cattell establishes assessment laboratory at the University of Pennsylvania, stimulating the study of
mental measurements.
1 890 Cattell coins the term mental test.
1897 Ebbinghaus develops and experiments with tests of sentence completion, short-term memory, and
arithmetic.
1904 Spearman espouses two-dimensional theory of intelligence {g = general factor, s = specific factors).
Pearson develops theory of correlation.
ca. 1905 E. L. Thorndike writes about test development principles and laws of learning and develops tests of
handwriting, spelling, arithmetic, and language. He later introduces one of first textbooks on the use of
measurement in education.
First standardized group tests of achievement published.
Jung's Word Association Test published.
1905 Binet and Simon introduce first "intelligence test," to screen French public schoolchildren for mental
retardation.
1909 Goddard translates Binet-Simon Scale into English.
1912 Stern introduces term mental quotient.
1916 Terman publishes the Stanford Revision and Extension of the Binet-Simon Intelligence Scale.
1917 Yerkes and colleagues from the APA publish the Army Alpha and Army Beta tests, designed for the
intellectual assessment and screening of U.S. military recruits.
1918 Otis publishes the Absolute Point Scale, a group intelligence test.
1919 Monroe and Buckingham publish the Illinois Examination, a group achievement test.
Woodworth Personal Data Sheet published.
1921 Rorschach publishes his inkblot technique.
1923 Kelly, Ruch, and Terman publish the Stanford Achievement Test.
Kohs Block Design Test measures nonverbal reasoning.
Foundations of Assessment 47
Table 2.1 continued
1924 Porteus publishes the Porteus Maze Test.
Seashore Measures of Musical Talents published.
Spearman publishes Factors in Intelligence.
1 926 Goodenough publishes the Draw-a-Man Test.
1927 Spearman publishes The Abilities of Man: Their Nature and Measurement.
1928 Arthur publishes the Point Scale of Performance Tests.
1931 Stutsman publishes the Merrill-Palmer Scale of Mental Tests.
1933 Thurstone advocates that human abilities be approached using multiple-factor analysis.
Tiegs and Clark publish the Progressive Achievement Tests, later called the California Achievement Test.
Johnson develops a test scoring machine.
1935 Murray and Morgan develop the Thematic Apperception Test.
1 936 Piaget publishes Origins of Intelligence.
Lindquist publishes the Iowa Every-Pupil Tests of Basic Skills, later renamed the Iowa Tests of Basic Skills.
Doll publishes the Vineland Social Maturity Scale.
1937 Terman and Merrill revise their earlier work (Terman, 1916) as the Stanford-Binet Intelligence Scale.
1938 Buros publishes first volume of the Mental Measurements Yearbook.
Bender publishes the Bender Visual-Motor Gestalt Test.
Gesell publishes the Gesell Maturity Scale.
1939 Wechsler introduces the Wechsler-Bellevue Intelligence Scale.
Original Kuder Preference Scale Record published.
1940 Hathaway and McKinley publish the Minnesota Multiphasic Personality Inventory (MMPI).
Psyche Cattell publishes the Cattell Infant Intelligence Scale.
1949 Wechsler publishes the Wechsler Intelligence Scale for Children (WISC).
Graduate Record Exam (GRE) published.
1955 Wechsler revises the Wechsler-Bellevue Intelligence Scale as the Wechsler Adult Intelligence Scale ( WAIS).
1956 Bloom publishes Taxonomy of Educational Objectives.
Kuder Occupational Interest Survey published.
1957 Osgood designs the semantic differential scaling technique.
1959 Guilford proposes the structure of intellect model in his The Nature of Human Intelligence.
Dunns publish the Peabody Picture Vocabulary Test.
National Defense Education Act provides funding for career assessment screening and high school counselor
positions.
1 960 Stanford-Binet Intelligence Scale revised.
1961 Kirk and McCarthy publish the Illinois Test of Psycholinguistic Ability.
1963 R. B. Cattell introduces theory of crystallized and fluid intelligence.
1965 Strong Vocational Interest Blank published.
1966 AEPvA, APA, and NCME publish the Standards for Educational and Psychological Testing.
1967 Wechsler publishes the Wechsler Preschool and Primary Scale of Intelligence (WPPSI).
1969 Bayley publishes the Bayley Scales of Infant Development.
National Assessment of Educational Progress program implemented.
Jensen publishes controversial How Much Can We Boost IQ and Scholastic Achievement?
1972 Form L-M (3rd ed.) of Stanford-Binet Intelligence Scale released.
McCarthy publishes McCarthy Scales of Children's Abilities.
1973 Marino publishes Sociometric Techniques.
1 974 Wechsler Intelligence Scale for Children — Revised ( WISC-R) published.
Congress passes the Family Educational Rights and Privacy Act (FERPA).
1975 Congress passes Public Law 94-142, the Education for All Handicapped Children Act.
Kuder's General Interest Survey, Form E published.
continued
48 Chapter 2
Table 2.1 continued
1977 System of Multicultural Pluralistic Assessment (SOMPA) published.
1979 Federal judge Roberr P. Peckham rules in Larry P. v. Wilson Riles that intelligence tests are culturally biased
when used to determine African American children's eligibility for mental retardation services.
1979 Leiter International Performance Scale, a language-free test of nonverbal ability, published.
1980 In Parents in Action on Special Education v. Joseph P. Harmon, Illinois judge Grady concludes that intelligence
tests do not discriminate against African American children due to cultural or racial bias.
New York state legislators pass Truth in Testing Act.
1 980s Volumes 1—7 of Test Critiques published.
High-speed computers begin to be used in large-scale testing programs.
Computer-adaptive and computer-assisted testing developed.
1981 Wechsler publishes the Wechsler Adult Intelligence Scale — Revised ( WAIS-R) .
1983 Kaufman publishes the Kaufinan Assessment Battery for Children (K-ABC).
1984 U.S. Employment Service publishes the General Aptitude Test Battery.
1985 Sparrow, Balla, and Cicchetti revise the Vineland Adaptive Behavior Scales, originally published by Doll
(1936).
AERA, APA, and NCME revise the Standards for Educational and Psychological Testing.
1986 Stanford-Binet Intelligence Scale — Fourth Edition (SBIS-4) published, as revised byThorndike, Hagen, and
Sattler.
1989 Minnesota Multiphasic Personality Inventory — Second Edition (MMPI-2) published.
Wechsler Preschool and Primary Scales of Intelligence revised.
1990s Authentic (performance) assessment and high-stakes testing rise to prominence.
Volumes 11-13 of Mental Measurements Yearbook published.
Volumes 8-10 of Test Critiques published.
1 99 1 Wechsler Intelligence Scale for Children — Third Edition ( WISC-IID published.
Kuder s Occupational Interest Survey, Form DD published.
1992 Wechsler Individual Achievement Test (W I AT) published.
1 997 Wechsler Adult Intelligence Scale — Third Edition (WAIS-IIJ) published.
1999 AERA, APA, and NCME publish Standards for Educational and Psychological Testing — Third Edition.
Volume 5 of Tests in Print published.
2000 Nader and Nairn publish The Reign ofETS.
2001 Mental Measurements Yearbook becomes available through an electronic retrieval system.
2002 Educational Testing Service revises its Scholastic Assessment Test (SAT).
Wechsler Preschool and Prima ry Scales of Intelligence — Third Editio n ( WPPSI- III) p u b I i s hed .
2003 Wechsler Intelligence Scale for Children — Fourth Edition ( WISC-IV) published.
Stanford-Binet Intelligence Scale — Fifth Edition (SB-5) published.
Ancient Times
Assessment has been used and documented in many civilizations throughout history.
As far back as 220 BCE, and continuing for more than 2,000 years, the Chinese had
an elaborate civil service examination system to select mandarins for public service
(Dubois, 1966, 1970). Every third year, candidates would gather to undergo tests of
skill in areas such as horsemanship, archery, and music. Essay tests were administered
to assess a candidate's writing skills.
Knowledge was assessed in such areas as military competence, civil law, geogra-
phy, and public and social ceremonies and rites. The Chinese strove to develop a fair
Foundations of Assessment 49
and objective system by eliminating systematic bias when observed. For example,
they used multiple judges to rare performance, rather than a single judge, and even
had scribes copy written work in a standard handwriting format to focus judges on
the ideas and content of a composition rather than on the differences in penmanship
between candidates (Thorndike, 1997). Even in the early years, they went to grear
lengths to prevent cheating by isolating candidates during written and performance
exams (Bowman, 1989). Many of these practices endure today. Such an elaborate
system was deemed necessary in order to select the best candidates on merit, not pa-
tronage — and the failure rate often exceeded 90%. These grueling exams went on
for 72 uninterrupted hours.
It is frequently hypothesized that the ancient Greeks, perhaps around 500 BCE,
used testing in the educational processes of that day. Indeed, both Socrates and Plato
are believed to have emphasized that efficient career choices should rely heavily on a
student's demonstrated abilities and aptitudes. Unfortunately, much of the histori-
cal record for the next 2000 years was lost. In 1 540, the Jesuits, a holy order of the
Roman Catholic Church dedicated to education and scholarly pursuits, became
early leaders in the establishment of assessment procedures at the university level by
administering the first written examinations. As one can imagine, this was a some-
what controversial endeavor, followed by much debate over bias and fairness. Nearly
60 years later, the Jesuits issued agreed-upon rules for administration of written
exams. This innovation was cautiously followed and implemented by other univer-
sities over the next several centuries.
Measurement in the Laboratory
A second "movement" in the history of assessment involved the use of testing in the
emerging field of experimental psychology. This field sought to harness the emerg-
ing use of the scientific method to explore the psychological world of human beings.
Prior to the use of the scientific method, mathematical models, such as those devel-
oped by Herbart, Weber, and Fechner were used to describe the effects of such con-
cepts as stimulus intensity and psychological thresholds.
Charles Darwin is often credited with spurring the experimental interest in in-
dividual differences through publication of his book On the Origin of Species by
Means of Natural Selection in 1859 (Cohen & Swerdlik, 1999). Darwin proposed
that individual differences in adaptation and characteristics accounted for the sur-
vival of entire species and individuals within species. His theory of evolution was
controversial and thought provoking. It was especially inspiring for Darwin's half-
cousin, the English biologist Sir Francis Galton, who made tremendously influenrial
contributions to the early attempts at measurement of individual differences and
cognitive heritability (Forest, 1974).
Galton developed numerous techniques and instruments for measuring individ-
ual physical and psychological characteristics, and his methods inspired the precur-
sors to modern-day rating scales and surveys. Overall, he inspired a whole generation
of laboratory researchers to determine individuals' "deviation from average" (Galton,
1869, p. 11) and to classify individuals "according to their natural gifts" (p. 1)
50 Chapter 2
through his studies of heritability on sweet peas. Galton's goal was to study human
heredity by measuring the characteristics of related and unrelated individuals and
showing that some characteristics made individuals more "fit for survival" than oth-
ers. He was one of the first scholars to propose that intelligence could be measured
through assessing sensory capabilities, for intelligence stems from information, and
all information must pass through the senses. Thus the more acute and attuned one's
senses, the greater the likelihood of information being passed through the senses and
influencing intellectual judgments. In 1884 he opened an exhibit at the Inter-
national Health Exhibition, which was later reestablished at University College,
London, as the Anthropomorphic Laboratory. Here Galton measured human char-
acteristics and abilities such as height, weight, arm span, muscular strength, reaction
time, discrimination of color, and visual acuity. These initial attempts at measure-
ment, while considered to be invalid measures of intelligence by today's standards,
nonetheless created widespread excitement in the burgeoning field of psychological
measurement. Galton also proposed the statistical concept of correlation, although
it was the mathematician Karl Pearson — Galton's student, close friend, and biogra-
pher — who later provided the statistical formula for linear correlation (i.e., the
Pearson product-moment correlation coefficient) that has endured to present day.
In 1879, Wilhem Wundt opened the world's first experimental psychology lab-
oratory, at the University of Leipzig in Germany. He is widely regarded as the
founder of the science of psychology (Hearst, 1979), and many of the early experi-
mental psychologists, including Louis Leon Thurstone and James McKeen Cattell,
studied at his lab. The hallmark of this era was the drive to rigorously control exper-
imental conditions in order to standardize observations and collection of data.
Cattell, a U.S. psychologist, was inspired by Galton's writings to conduct his
doctoral dissertation on individual differences in reaction time, a study that contin-
ued the momentum toward measurement of human characteristics. In 1890, Cattell
was the first to use the term mental test to describe his efforts to measure intelligence.
Kraepelin (1895) and his student Oehrn (1889), developed more sophisticated men-
tal ability tests, including arithmetic, memory, and perceptual tasks. In addition,
Ebbinghaus (1897) developed sentence completion, arithmetic, and short-term
memory tasks. All of these early efforts to develop psychological tests continued the
movement to the modern era of assessment.
Modern Clinical Applications of Assessment: Decision Making
and Determination of Individual Differences
In any field of study it is important for the stage to be set with precursor events until
a critical mass of knowledge has developed; historical events or social needs arise; and
motivated, creative thinkers move the emerging field forward. In the field of assess-
ment, many pioneers took the developing field in numerous directions quite quickly,
leading to an explosion of assessment applications during the 20th century. These
applications were primarily directed at identifying differences between and among
individuals so that identification and diagnostic practices, as well as intrapersonal
strengths and weaknesses, could be translated into remedial and treatment strategies.
Foundations of Assessment 5 1
At the core, these efforts were directed at helping clinicians and educators make bet-
ter, more accurate decisions about human beings than could be made through other,
less standardized methods of the day. Most notably, the field moved in four primary
directions: intellectual assessment, achievement assessment, vocational and career as-
sessment, and clinical and personality assessment.
Intellectual Assessment
Many individuals have contributed to the rise of testing with educational and clini-
cal applications. In many ways, the work of Galton, Cattell, and Kraepelin laid the
foundation for the proliferation of these tests during the 20th century. Early at-
tempts at measuring intelligence stemmed from the need to develop procedures to
identify students with mental and emotional deficiencies for remedial education. In
the earliest recorded attempt, Seguin (1866/1907) in 1837 developed the Seguin
Form Board Test, which in some ways resembles modern efforts to assess mental
deficiencies.
In France, the minister of public instruction appointed physiologist and psy-
chologist Alfred Binet to a commission tasked with determining efficient ways to
identify children with mental retardation. Working with a French physician,
Theodore Simon, Binet constructed the first practical intelligence test in 1905, the
Binet-Simon Scale. This scale presented 30 brief tasks in approximate order of diffi-
culty accompanied by relatively precise administration instructions. The original
scale was administered under these standard conditions to a standardization sample
of 50 children. With this comparison group, Binet could now determine any new
child's score and evaluate or interpret it within some context. This revolutionary
process, while crude by today's standards, allowed for a rudimentary decision-mak-
ing process about a child's intellectual ability. In addition, Binet and Simon departed
from the traditional focus on assessing sensory processes and focused item develop-
ment more on reasoning and judgment. Unfortunately, the original scale derived no
index or standardized score other than a raw score. Thus interpretations were limited
primarily to descriptions of whether the child had basically normal intelligence or
how far above or below normal the child's score appeared to fall. A further limitation
of the original scale was the poor representativeness of the standardization sample to
the overall population.
These limitations were addressed in the 1908 revision of the Binet-Simon Scale,
which nearly doubled the number of items on the original scale. The standardiza-
tion sample included more than 200 children and was more representative of the
population the test was meant to assess. In addition, Binet introduced the concept
of mental age, an important innovation at the time, which allowed the evaluator to
determine performance in terms other than the raw score. Each scale task or item
was evaluated to determine the average chronological age at which a child mastered
the task. This helped to specify normal or average performance for each item accord-
ing to an age equivalency, which became the item's "mental age." Thus a normal 7-
year-old child would achieve a mental age of approximately 7 years, while a bright
seven-year-old might have a mental age of 9 or 10 years. Conversely, a 7-year-old
child with mental retardation might have a mental age closer to 4 or 5 years. The
52 Chapter 2
child's mental age (MA) and chronological age (CA) could be used to calculate a
ratio intelligence quotient (IQ) using the formula [MA ■*■ CA] x 100, a rudimentary
form of the modern IQ score.
The Binet-Simon Scale received a minor revision again in 1911, but by this time
the interest in assessing intelligence had caught on in a number of countries, includ-
ing the United States. Lewis M. Terman of Stanford University translated the Binet-
Simon Scale into English, adapting, revising, and adding many items and instruc-
tions in the process. In 1916, Terman released the Stanford Revision and Extension of
the Binet-Simon Intelligence Scale, featuring a standardization sample of more than
1,000 people. In 1937, this test became the Stanford-Binet Intelligence Scale (SBIS),
revised in 1960, 1972, and 1986. The SBIS is now in its fifth edition (Roid, 2003).
Terman's contribution was noteworthy in several ways, perhaps most impor-
tantly because it made the widespread assessment of intelligence possible. This was
timely because around the same time that Terman released the Stanford-Binet, World
War I broke out and the military had a tremendous need to screen soldiers in order
to assign them to appropriate duties in an efficient manner. The army contacted
Robert Yerkes, then president of the American Psychological Association, to seek the
association's help in developing large-scale assessment instruments for selection and
classification. Instruments of that time period were nearly all individual assessments,
which were generally time-intensive and cost-prohibitive, requiring highly skilled
evaluators — not the kind of efficient tools needed to screen thousands of military re-
cruits each month. In 1917, Yerkes (1921) led a committee of many of that era's
greatest measurement experts to produce two group-administered tests of ability: the
Army Alpha, which required reading ability and comprehension, and the Army Beta,
a nonverbal test used to assess the abilities of illiterate or non-English-speaking
adults. These tests used a multiple-choice format, a recent innovation popularized
by Arthur S. Otis. Although the tests were not completed in time to be of help in
screening World War I recruits, these early efforts at developing individual and
group-administered tests of intellectual ability fueled widespread optimism about the
role assessment could play in society, especially in institutions such as education and
the military.
Interestingly, the first tests of intelligence were produced with little thought
given to theoretical underpinnings — that is, they were atheoretical. It was not until
the late 1920s that discussions about the definition, makeup, and characteristics of
intelligence were held by scholars. Spearman (1927) proposed that intelligence is dis-
played in two dimensions: one that helps an individual solve general tasks (g), and
another that helps individuals solve specific tasks (s). Spearman's concept (g), per-
haps the most famous, and infamous, in the field of intelligence testing, spurred a
great deal of empirical study and philosophical and political discussion. For example,
in contrast to Spearman, Thurstone argued that intelligence was not explained by
one general (unidimensional) factor called intelligence, but was actually composed
of seven primary mental abilities. Much more discussion on the topic of intellectual
theories and models is presented in Chapter 10. Suffice it to say here that the early
efforts by Binet, Spearman, Thurstone, and many others led to an explosion in
modern-day intelligence and aptitude testing.
Foundations of Assessment 53
To be sure, there have been several periods of criticism associated with testing in
general, and intelligence testing in particular. The first came during the 1930s, the
time of the Great Depression, and stemmed from unclear expectations over the roles
tests could and should play in measuring human experiences and abilities. Many
challenges related to how to measure human abilities were raised during this time.
Fortunately, social sciences took on these challenges with gusto, developing new as-
sessment methods, tests, and more powerful statistical techniques to aid in analyzing
test items and results.
Perhaps the most famous name in U.S. intelligence testing today is David
Wechsler (Wechsler passed away in 1981). In 1939, Wechsler, at the time a clinical
psychologist in New York City's Bellevue Hospital, published the Wechsler-Bellevue
Intelligence Scale. This individually administered test of adult intelligence was de-
signed to measure the "global capacity of the individual to act purposefully, to think
rationally, and to deal effectively with his environment" (p. 3). In 1955, Wechsler
revised the Wechsler-Bellevue and changed the name to the Wechsler Adult Intelligence
Scale (WAIS). It was revised again in 1981 and 1997, and this most recent edition is
known as the Wechsler Adult Intelligence Scale — Third Edition (WAIS-III) (Wechsler,
1997). Wechsler's adult test offered several innovations or practical facets that be-
came industry standards over the years. First, his test was actually a series of "sub-
tests," each measuring a different facet of intelligence. Each facet contributed to the
overall (full-scale) intelligence quotient. Also, Wechsler was one of the first to use a
standard deviation IQ, rather than the ratio IQ popularized by the Stanford- Binet.
Finally, Wechsler took a very pragmatic view of intelligence, rather than a theoreti-
cal view. Basically, Wechsler chose what he believed to be the most efficient and use-
ful measures of intelligence from previously developed measures and developed orig-
inal items to create a particularly engaging and user-friendly format. Sources for his
subtests included the Army Alpha (Information, Comprehension, and Picture
Arrangement) and Army Beta (Coding); the 1916 Stanford-Binet (Vocabulary,
Similarities, Comprehension, Digit Span, and Arithmetic); the Healy Picture
Completion Tests (Picture Completion); and the Kohs Block Design Test. Importantly,
Wechsler combined scores from each subtest to arrive at an estimate of general men-
tal ability (g), not numerous primary mental abilities or specific facets of intelligence
that others (e.g., Louis Leon Thurstone, Robert Sternberg, Howard Gardner) have
described. The subtest format was simply a method for measuring general intelli-
gence through multiple measures.
The success of the adult Wechsler scale led Wechsler to develop a version for use
with school-aged children from 6 to 16 years. In 1949, Wechsler published the
Wechsler Intelligence Scale for Children (WISQ. The WISC was revised in 1974, 1991,
and 2003. It is currently known as the Wechsler Intelligence Scale for Children —
Fourth Edition ( WISC-IV) (Wechsler, 200 1 a) and follows a subtest format similar to
that of the adult version. It is the most commonly used individually administered
intelligence test in the world. In order to address the recent increased need for
assessing intelligence in the preschool population, Wechsler (1967) published the
Wechsler Preschool and Primary Scale of Intelligence (WPPSI), again following a sub-
test format similar to that of the child and adult Wechsler versions. The WPPSI was
54 Chapter 2
revised in 1989 and again in 2002 and is currendy known as the Wechsler Preschool
and Primary Scale of Intelligence — Third Edition (WPPSI-III) (Wechsler, 2002). The
Weschler series of intelligence tests has significantly influenced intelligence testing
and the profession's conceptualization of intelligence, and is reviewed in more detail
in Chapter 13.
A second period of intense social and political criticism developed during the
1960s and 1970s due to several societal factors, including the civil rights movement
and congressional hearings into rights to privacy. This period was termed the Era of
Discontent by Maloney and Ward (1976). Several influential books and court cases
occurred during this period. Whyte (1956), in Organization Man, accused users of
employment and other selection tests of choosing workers who fit the organizations
structure, or status quo, rather than those who would do the best work or were most
qualified. Houts (1977), in The Myth ofMeasurability, insisted that tests were instru-
ments of oppression used by the privileged to control the poor. Houts maintained
that tests punished creative individuals, caused irreparable damage to children
through educational labeling, and generally were being used to make decisions the
tests either were not meant to make or lacked the technical adequacy (e.g., reliabil-
ity, validity) to make.
In 1967, in Hobson v. Hansen, a federal judge determined standardized group
ability tests to be biased and discriminatory against minorities, rendering the tests
unacceptable as placement tests for special education. In 1979, another federal
judge made a similar ruling regarding individualized intelligence tests in Larry P.
v. Wilson Riles. During the 1990s, New York State's Truth in Testing Act, ostensi-
bly passed over concern about the possible misuse of Scholastic Assessment Test {SAT)
test scores, requires the release of all questions used on the administration of the
SAT after it has been conducted. While perhaps well intentioned, this law allows
the public to view every question comprising recent versions of SAT administra-
tion, in effect making the items unavailable for further use. Such a practice drives
up the cost to consumers (i.e., the parents of college-bound youth), because the
College Board must spend a great deal of extra money to constantly create new
items that have a one-time-only use.
During this period, many expressed concern over the widespread use of intelli-
gence and personality tests in employment and school testing programs (Thorndike,
1997). Indeed, in 1972, the National Education Association actually called for an
end to routine standardized achievement, aptitude, and intelligence testing. It was
feared that such tests could be used, intentionally or unintentionally, to discriminate
against people, particularly women and minorities. It was demonstrated that the
content of some tests did, in fact, lead to discrimination in decision making, al-
though not to the degree critics insisted was the case. However, as reported by
Anastasi (1976), tests were already routinely being used to make decisions about col-
lege admissions, schoolchildren with learning difficulties, and adult populations with
special needs. Often these test scores were used to make decisions that were beyond
the test's technical specifications, leading to widespread criticism, disillusionment,
and skepticism.
Again, test developers viewed these criticisms as challenges to be overcome
through scientific study and developed procedures and methods to identify and cor-
Foundations of Assessment 55
rect biased test content. This process led to a movement to develop culturally fair
and unbiased tests that is firmly implanted to this day. Nevertheless, in spite of ef-
forts by the test publishing industry to address these issues and to allay public con-
cerns, periodic legislation and court decisions occur that restrict the use of tests, be-
cause no test, no matter how well developed, is perfect. Furthermore, tests are
interpreted by professionals with varying levels of training and expertise, and mis-
takes can and do occur. Of course, it is these mistakes that end up in legislative
houses and court buildings, reported by the press, and concerning the public. A
moderate amount of public wariness regarding testing can be expected to continue
well into the future, and is probably helpful in keeping test developers and test users
focused on best practices of test use. We further explore public concerns about test-
ing later in this chapter.
Achievement Assessment
On the achievement testing front, a major shift in educational assessment occurred
in 1845 when the Boston public school system opted for written essay exams over
the traditional oral exams (Anastasi & Urbina, 1997). Interestingly, the arguments
in favor of moving to this radical form of testing included broader content coverage,
standardized conditions, standardized item selection, and reduced possibility of
favoritism. If these criticisms of oral testing sound familiar, they should. Several were
the same arguments later used to replace essay exams with the multiple-choice
format.
Between 1897 and 1903 in the United States, Joseph Mayer Rice tested tens of
thousands of students to create the first large-scale standardized tests of spelling,
arithmetic, and language. Rice's work stimulated additional attempts at standardized
test development by Edward L. Thorndike of Columbia University's Teachers
College. During the early 20th century, the Teachers College became the hub of ef-
forts to standardize educational tests, and Thorndike and the assessment specialists
he trained were at the center of the revolution. It was at this time that issues of sub-
jectivity of essay and extended-response items were explored. Test developers and
users of this era were quick to notice that judges often did not agree on the "correct-
ness" of a constructed answer. As a result, multiple-choice and other forced-choice
item response formats were developed and came into prominence during the first
several decades of the century. The advent of test scoring machines around the mid-
dle of the century made multiple-choice formats even more popular, as thousands of
test protocols could be scored with ever-increasing efficiency (i.e., less time, greater
accuracy, fewer scorers required).
In the first two decades of the 20th century, achievement was measured either
by a single test combining several subject areas into a single score, or by a single test
constructed to measure a single subject area score. In 1923, Truman L. Kelly, Giles
M. Ruch, and Terman published the first edition of the Stanford Achievement Test
{SAT — not to be confused with the Scholastic Assessment Test), which is currently in
its 10th edition. The Stanford Achievement Test was the first standardized achieve-
ment battery and was designed to measure several subject areas simultaneously and
to report each area score separately. In this way, a teacher could understand a stu-
dent's separate performances in math, reading, and spelling through administration
56 Chapter 2
of a battery of achievement tests. Also, the SAT provided a national standard of com-
parison so that performance of students in one school could be compared to that of
students in various other parts of the country. As normed, multiple-choice measures,
standardized achievement tests had many advantages over teacher- administered and
teacher-scored essay-based tests, which had previously dominated public and private
school education. Standardized achievement tests were relatively easy to administer
and score, objective (i.e., minimized favoritism), and less expensive; covered broader
ranges of content; and gave a measure of student performance against that of others
in the same grade. By the 1930s, standardized achievement tests were widely viewed
as more reliable, meaningful, and fair than essay tests (Anastasi & Urbina, 1997).
Numerous group-administered achievement test batteries have been developed
over the years. In 1936, Everett F. Lindquist published the Iowa Every-Pupil Tests of
Basic Skills, an achievement battery known today as the Iowa Tests of Basic Skills (6th
edition). Lindquist also later developed an electronic test scoring method that made
mass scoring of multiple-choice questions quick and inexpensive. The Metropolitan
Achievement Test, originally published in 1931, is now in its 8th edition. A recent ar-
rival, TerraNova 2(CTB/McGraw-Hill, 2001), resulted from a merging of the most
recent revisions of the Comprehensive Test of Basic Skills and the California
Achievement Test. In 1969, the United States launched the National Assessment of
Educational Progress program to determine the effectiveness of the country's educa-
tional system and track changes in student characteristics and performance over
time. The program is still in operation today.
Perhaps the single most influential occurrence in educational testing was the
passage of Public Law 94-142— The Education for All Handicapped Children Act
(1975), which provided federal oversight and funding for special education programs
across the country. Refunded in 1990 and now known as the Individuals with
Disabilities Education Act (IDEA), this landmark legislation led to the widespread
use of individualized intelligence and achievement tests in public schools. Public Law
94-142 resulted in educational services being provided to millions ol students who
have substantial learning problems, including learning disabilities, mental retarda-
tion, emotional disturbances, and visual, hearing, or orthopedic impairments.
Several important individual achievement batteries were developed in the late
1970s and 1980s to address the need for assessment of learning problems and were
immediately put to use to assess the achievement ol children and adolescents. These
batteries included the Woodcock-Johnson Tests of Achievement, now in its third edition
(WJ-IIIACH) (Woodcock, Mather, & McGrew, 2001); the Peabody Individual
Achievement Test, now in its revised edition (P/A'T-R) (Markwardt, 1998); and the
Wechsler Individual Achievement Test (1992), now in its second edition {WIA'T-If)
(Wechsler, 2001b).
At about the same time as Public Law 94-142, Congress passed the U.S. Reha-
bilitation Act of 1973. While the act was well known at the time for requiring wheel-
chair access ramps, curb cutting, and elevators in buildings and localities that ac-
cepted federal hinds, some provisions went unnoticed until years later. Section 504
of this act required that any individual with a mental or medical impairment that
affects occupational, learning, or social functioning (among others) is entitled to
Foundations of Assessment 57
accommodations to facilitate success. Section 504 accommodations are commonly
provided to students in schools today whose mental or medical conditions are not so
severe as to qualify for services under IDEA.
During the 1980s and 1990s, many educators criticized the reliance on mul-
tiple-choice testing on the grounds that it does not allow assessment of students'
understanding of depth of content or reasoning, their ability to integrate knowl-
edge from various aspects of a discipline of knowledge, or their ability to explain
complex thoughts and ideas, because they are only required to color-in a bubble,
rather than to construct their own meaningful written response. This backlash led
a large number of states and school systems to develop "authentic," or perform-
ance-based, assessment programs. Generally, these assessment programs present
students with real-life problems to be solved, usually resulting in some constructed
essay response. However, while multiple-choice questions present with strengths
and limitations, so do performance-based tests. As explained in Chapter 1, one of
the very important primary problems with performance-based assessment is its
lower test score reliability. Many performance-based assessments do not reach a
minimally acceptable standard of reliability to report an individual student's score.
The passage of the No Child Left Behind Act of 2001 (NCLB) will likely reduce
the use of performance-based tests because it requires that individual scores be re-
ported in reading and math for students in grades 3 through 8. Still, many educa-
tors view portfolios and performance-based assessments as better indicators of stu-
dent performance than multiple-choice tests (Muir & Tracy, 1999; Russo &
Warren, 1999).
Vocational and Career Assessment
Although he never developed a standardized assessment of vocational development,
Frank Parsons was a pioneer in the vocational guidance movement and has come to
be known as a founder of the school guidance movement. He advocated for the un-
derstanding of the person and the world of work so that an individual could be
matched with an appropriate occupation. Thus the specialized field of career assess-
ment was born. Numerous applications and venues for career assessment have de-
veloped over the years, and career assessment often integrates knowledge of an indi-
vidual's aptitudes, achievements, interests, competencies, values, and beliefs.
Around World War I, aptitude testing became critically important in the mili-
tary (e.g., Army Alpha and Army Beta), followed by more specific applications to vo-
cational choices. The gains made in the field of intelligence testing coupled with the
realization that multiple abilities could be assessed (not just g) led to widespread
applications in aptitude assessment. For example, scholastic aptitude tests were
developed as far back as the 1920s to help identify students with the capabilities to
meet the academic challenges of higher education.
During the 1920s and 1930s, the use of aptitude tests became common in
industry for the selection and classification of employees. Specialized tests measuring
mechanical and clerical aptitudes were particularly commonly used. Perhaps more
important in the long run, several vocational interest inventories were developed,
foreshadowing the importance vocational counseling would hold in the future.
58 Chapter 2
During this time, Edward K. Strong published the Strong Vocational Interest Blank
(today known as the Strong Interest Inventory), and Frederick Kuder published the
Kuder Preference Record — Vocational.
During World War II, the armed services again had great need to identify re-
cruits who could fulfill increasingly technical job responsibilities. This need, along
with development and refinement of the statistical technique known as factor analy-
sis, led to the further development of specialized aptitude tests and the general mul-
tiaptitude batteries. These multiaptitude batteries could help identify an individual's
strengths and limitations, as well as predict performance in certain academic and vo-
cational tasks. They still enjoy widespread popularity in many high school career as-
sessment programs today because they can provide insights into intrapersonal
strengths and weaknesses, thus helping to determine which higher education or vo-
cational choices may make a good fit. These multiaptitude batteries, further de-
scribed in Chapter 1 1, include the General Aptitude Test Battery, the Differential
Aptitude Test (DAT), and the Armed Services Vocational Aptitude Battery (ASVAB).
In 1959, in response to the successful Soviet launching of the first satellite,
Sputnik, Congress passed the National Defense Education Act, funding school guid-
ance counselor positions in high schools across the country with the express purpose
of identifying students showing promise in the mathematical and science fields.
Professional school counselors quickly learned to rely on career aptitude and inter-
est inventories to help with this task. Numerous vocational interest, career values,
and belief inventories have been published over the past 50 years, aiding counseling
professionals in effectively addressing the critically important role career counseling
plays in society today. In particular, career counselors, college counselors, and pro-
fessional school counselors frequently use and encounter vocational aptitude and as-
sessment instruments in their work.
Clinical and Personality Assessment
Clinical assessment pertains to the identification of mental disorders and related
syndromes. Personality assessment is the applied area of psychology and counseling
concerned with the measurement of nonintellectual affective characteristics.
Importantly, many use the term personality in the broadest holistic sense and actu-
ally include the measure of intellect, aptitude, and achievement under a global cate-
gory (Anastasi & Urbina, 1997). However, in the parlance of psychological and ed-
ucational assessment, personality assessment is generally most concerned with
attitudes, characteristics, motivations, and interpersonal and affective traits.
During World War I, the U.S. armed forces became interested in identifying re-
cruits who were psychotic or otherwise not emotionally capable of military service.
Asked to develop a personality inventory that could be efficiently administered to
large groups of recruits, in 1919 Robert S. Woodworth developed the Woodworth
Personal Data Sheet, basically a structured papcr-and-pencil psychiatric evaluation.
WWI ended without the original test ever being put into use. However, this proto-
col was later released for civilian use, and its creation spurred development of an en-
tire generation of self-report personality and clinical inventories during the 1920s
and 1930s. Unfortunately, these self-report tests assumed that respondents would
Foundations of Assessment 59
answer truthfully and be'of sound mind and judgment. Of course, those the test was
meant to assess might be of neither sound mind nor judgment; the tests were trans-
parent and responses easily faked. For example, one of the more famous questions
from the Woodworth was, "I drink a quart of whiskey each day." From a social de-
sirability perspective, it is even very easy for persons with an addiction to alcohol to
see the consequences of such a question. No real procedures were in place for cross-
validation of responses, so clinicians frequently made decisions based on untruthful
responses, resulting in tremendous criticism of this burgeoning and promising area
of assessment.
Test developers again went to work devising "validity scales," subscales that
attempt to measure a client's forthrightness when answering questions. A milestone
in personality and clinical test construction occurred in 1940, when Starke R.
Hathaway and J. Charnley McKinley published the Minnesota Multiphasic Per-
sonality Inventory (MMPT). This test led a resurgence of self-report personality inven-
tories because it addressed the issue of respondent forthrightness and developed sev-
eral validity scales that helped examiners to identify potentially invalid test protocols.
The MMPI has become the most commonly used and widely researched structured
clinical inventory in the history of assessment. Importantly, the MMPI was devel-
oped and used to assess the clinical population for mental and emotional disorders,
not the personality functioning of nonclinical individuals. However, the success of
the MMPI in addressing the critics of self-report inventories spurred numerous other
clinical inventories (e.g., Millon Clinical Multiaxial Inventory {MCMP), Beck
Depression Inventory (BDI), Achenbach System of Empirically Based Assessment
(ASEBA) and behavioral inventories (e.g., Conners' Rating Scales (CRS-R), Behavior
Assessment System for Children (BASC) for clinical purposes, as well as personality in-
ventories used with the general population (e.g., Myers-Briggs Type Indicator, 16 PF).
The MMPI is now in its second edition (MMPI-2) (Butcher et al., 1989) and also
has an adolescent version, the Minnesota Multiphasic Personality Inventory —
Adolescent {MMPI-A) (Butcher et al., 1992). The MMPI is also somewhat different
because it was not developed using factor analysis; instead it relies on items that are
empirically derived and criterion based. Currently, trait perspectives and the. five-
factor model (Costa & McCrae, 1992) dominate the field of structured personality
assessment. Traits are enduring characteristics, and the research on personality assess-
ment appears to consistently identify a limited number of traits that underlie per-
sonality functioning (e.g., optimism, extroversion, openness to experience). This
model and numerous clinical and personality inventories are explored further in
Chapter 8.
Another method of personality assessment was conceived at about the same
time as Woodworth's self-report measure. In 1921, Swiss psychiatrist Hermann
Rorschach created a set of inkblots that aspired to provide examiners an x-ray view
of a client's personality. The Rorschach Inkblot Test sought to explore individuals'
unconscious thoughts and reelings by allowing them to "project" these thoughts,
feelings, needs, hopes, fears, and motivations onto ambiguous stimuli in an un-
structured task — in this case a blot of ink on a piece of paper that was folded in
half to form an otherwise meaningless, bilaterally symmetrical design. The inkblot
60 Chapter 2
itself holds no meaning; clients attempt to structure the activity by projecting
meaning from the perspective of their own worldviews and particular personali-
ties. Response requirements are purposefully unclear, and the scoring criteria often
are very subjective. The technique did not catch on immediately in Europe but be-
came very popular in the United States during the 1930s and 1940s, when it was
adopted by many psychoanalysts, who viewed it as consistent with Freud's goal of
exploring the unconscious. The technique became even more popular in the 1950s
and 1960s as the field of clinical psychology and personality assessment in general
grew tremendously.
Numerous other projective tests have been developed, including single-word as-
sociations (e.g., "Say the first thing that comes into your mind when I say the word
mother"); incomplete-sentence blanks (e.g., "Complete this sentence: Friends think
I "); and drawing and storytelling tasks. In 1935, Henry A. Murray and
Christiana D. Morgan published the Thematic Apperception Test (TAT), which aimed
to give clinicians insight into client personality functioning by having the client look
at ambiguous pictures and tell a story about each. Ostensibly, clients would project
their needs and motivations into the story, yielding valuable clinical insight. Well-
known drawing techniques include the House-Tree-Person (H-T-P) and Kinetic
Family Drawing (KFD) techniques. For example, in the H-T-P clients draw pictures
of a house, a tree, and a person, and the examiner generally asks a number of follow-
up questions about each drawing. Each technique shares a common thread: There
are no right or wrong answers, just what is on one's mind and projected into the sit-
uation. Projective tests have the advantage of promoting forthrightness in clients be-
cause they usually have no idea what is expected, and therefore find it difficult or
unnecessary to be deceitful.
Think About It 2.1 What events or issues appear to consistently spark
the interest of the government and citizenry in testing?
General Historical Events Affecting Assessment
While the specialized disciplines of assessment (intellectual, achievement, career, per-
sonality) each contributed milestones of import, many more general events con-
tributed to the integration and advancement of the field. And while many of the
landmark advances in testing stemmed from wartime needs, the successful use of
tests in the military led to their widespread use in other avenues of society, includ-
ing education and industry. Important societal needs during the middle decades of
the 20th century drove this utilization, including free public education, substantial
population increases, mandatory school attendance, large increases in the number of
college-bound youth, civil rights movements for women and minorities, and the
rights of handicapped children and adults. Many of these testing initiatives stemmed
not only from general societal concerns, but also from specific test-related issues such
as sexual bias, cultural bias, and unfairness to certain segments of the population, all
leading to improvement in the development of tests.
Foundations of Assessment 61
The rapid advancement of testing in the 1920s and 1930s led to a tremendous
need to identify, catalog, and provide critical evaluations of available instruments. To
fill this need, in 1938 Oscar K. Buros published the first edition of the Mental
Measurements Yearbook (MMY). A new edition of these test reviews is produced every
couple of years and is now available in full text (online or CD-ROM) through most
university library systems.
With the proliferation of thousands of tests being published during the first half
of the 20th century, test developers and examiners realized that there was a lack of
standards governing the development and use of psychological and educational tests.
The American Psychological Association published a guidebook of technical recom-
mendations for test use in 1954 and was joined by the American Educational
Research Association (AERA) and the National Council for Measurement and
Evaluation (NCME) in 1974 to publish the first edition of Standards for Educational
and Psychological Tests. These standards were revised in 1999 and continue to serve
as a resource for the use and evaluation of tests. Likewise, the Association for
Assessment in Counseling and Education (AACE) published the Responsibilities of
Users of Standardized Tests (RUST-3) statement, which is now in its third edition
(AACE, 2003a).
In one of the first cooperative mergers among test publishers, the American
Council on Education (ACE), the Carnegie Corporation, and the College Entrance
Examination Board (CEEB) combined forces during the 1950s to establish the
Educational Testing Service (ETS). This merger centralized the publication and scor-
ing of some important tests into a profitable and convenient joint endeavor. ETS
continues to publish the Scholastic Assessment Test (SAT) and the Graduate Record
Exam (GRE) to this day.
In education, the pendulum continues to swing. Most notably, the humanistic
orientation of the 1970s was replaced by a back-to-basics movement and the current
standards-based and high-stakes approaches to assessment. The back-to-basics move-
ment led many states to develop minimum competency examinations that were de-
signed to ensure that students graduating from high school had the minimum essen-
tial academic skills to function in a modern society (Lerner, 1981). High-stakes
testing (a chapter on this subject is available on the companion website) may result
in students not being promoted to the next grade or not graduating from high school
unless achieving a certain minimum level of proficiency measured by the test.
Similarly, some states have mandated examinations for teachers to demonstrate that
they can read, write, and communicate effectively and that they have mastered the
content of the subject they were hired to teach.
Several significant pieces of legislation were passed during the 1970s, including
the 1974 Family Educational Rights and Privacy Act (FERPA), which mandated the
rights of parents and children over the age of 18 years to view school records and re-
quired parental consent for assessment conducted around specific topics.
Computers have changed the complexion of assessment and will continue to do
so for the foreseeable future. Computers can now be used to administer, score, and
interpret numerous psychological and educational tests, greatly aiding the efficiency
of the process. Now examiners can receive scoring and interpretive services in the
62 Chapter 2
comfort of their own offices for assessment instruments as diverse as career, achieve-
ment, and intelligence tests — even the MMPI-2 and Rorschach.
Computer-assisted career guidance programs were devised in the 1960s and
continue to grow in strength and purpose even today. High school students regularly
cruise the Internet to take online career inventories, find information about career
and educational opportunities, and even locate scholarship funds and complete on-
line college and job applications. Accessible, low-cost, and quick, the immediate re-
sults and feedback of such innovations are the primary reasons for their continued
success (Zunker & Norris, 1998).
Adaptive testing has made administration and scoring of large-scale testing pro-
grams even more efficient. College students taking the GREs can now spend less
time on the computer-administered version than they would sitting in a classroom
with a paper-and-pencil version, and they can even find out their scores at the con-
clusion of the tests rather than anguishing for weeks. Schools can now receive com-
puter-generated interpretive reports that can be given to parents so that they may
understand their children's performances. Clients can take tests online, via the
Internet, making assessment incredibly convenient and efficient for everyone.
However, with technological innovation come ethical and legal challenges, topics
that are addressed later in this chapter.
Issues of diversity in assessment have been addressed by several professional or-
ganizations, and the AACE has compiled a list of these standards (http://aace.ncat
.edu). During the 1990s, education experienced a shift toward performance- based,
authentic assessment, which strives to assess students' depth of understanding by hav-
ing them perform a task rather than take a pencil-and-paper examination. Likewise,
an assessment initiative known as portfolio assessment became very popular during
this time. Used for decades in modeling, art, and architecture, portfolios are a col-
lection of performance products or samples that can be displayed and evaluated ac-
cording to quality indicators. Breadth and depth of understanding displayed through
real-life performance is key to this form of assessment.
In summary, the past century has witnessed the many ups and downs of testing
as well as professional and technological innovations. Many criticisms have been pro-
posed, leading to changes in test development procedures and administration prac-
tice. The next section explores some of these concerns in more detail.
PUBLIC AND PROFESSIONAL CONCERNS
ABOUT ASSESSMENT
Millions of tests are given annually to help make decisions about peoples' lives. The
scope of test use in the United States alone is immense. The No Child Left Behind
Act of 2001 requires standardized testing of all public school students in grades 3
through 8. Nearly 2 million high school students take a college admissions test such
as the 5>iror/4C7*each year. Almost 75,000 take a special admissions test lor busi-
ness school, and more than 100,000 take one lor law school admission.
Foundations of Assessment 63
Tests are important and helpful sources of information that, when used appro-
priately, help decision makers make better, more accurate decisions than can be made
without the use of assessment instruments. However, sometimes the process does not
work as planned. Decision makers may sometimes misunderstand the purpose of a
test or use tests to make decisions for which the test scores were never validated.
Sometimes the actual assessment process or the criteria for success are perceived as
unfair by professionals or the public. Finally, the issue of testing has sometimes been
viewed as a political tool, and has been used as one by some critics. Testing is big
business, meaning big money. Also, allocation of resources for schools and individ-
uals with disabilities or certain economic considerations is frequently tied to test per-
formance. For example, in some states higher-performing schools meeting state goals
have been rewarded with monetary compensation (e.g., program funding). In oth-
ers, lower-performing schools have received increased levels of funding for new aca-
demic initiatives to help close the achievement gap. In mental health clinics and
practices around the country, third-party reimbursement is achieved through assess-
ment and diagnosis of mental disorders. Eligibility for special education services
under IDEA or accommodations under Section 504 of the U.S. Rehabilitation Act
of 1973 involve assessment procedures preceded or followed by funding allocations.
In many ways, funding and assessment go hand in hand, meaning that politics are
inevitably involved.
Ebel (1976) indicated that primary critics of testing include professional educa-
tors concerned about the effect standardized testing has on accountability and cur-
riculum in the schools, reformers who view standardized testing as outmoded and
counterproductive to quality instruction, and media representatives looking to reveal
scandalous proceedings in social institutions. In fairness, the majority of teachers and
the vast majority of parents support the use of standardized testing, but a vocal, polit-
ically motivated minority keeps the issue at the forefront of national attention.
This is not to say that standardized testing has not been used in ways deserving
of criticism. Table 2.2 lists numerous issues creating public concern, even com-
plaints. Throughout this book, best practices meant to mitigate each of these com-
plaints will be addressed in some manner. Here we give a brief treatment of these
complaints.
Table 2.2 Some public complaints about tests
■ Decisions about children's lives should not be made on the basis of a single high-stakes test
score.
■ Tests are biased and unfair to minorities and women.
■ Tests create anxiety and stress.
■ Tests label and categorize.
■ Test developers dictate what students must know or learn.
■ Teaching to the test inflates scores.
■ Multiple-choice questions punish intelligent, creative thinkers; trivialize the complexities of
the learning process; and reward good guessers.
64 Chapter 2
Decisions About Peoples' Lives Should Not Be Made
on the Basis of a Single High- Stakes Test Score
We couldn't agree more! Professional counselors who make decisions about the lives
of others using a single test score are behaving unprofessionally, unethically, and, de-
pending on the location of practice, perhaps illegally. All major national professional
organizations agree on this point, as a quick perusal of major national organization
position statements on high-stakes testing will support. The same is generally true in
education. For the past 30 years and continuing through today, U.S. law has forbid-
den placement of students in special education classes on the basis of a single test.
Today, legal battles have ensued over a state's ability to withhold a diploma from a
high school student who met all curricular requirements and passed all academic
coursework but failed to obtain a minimum acceptable score on the state's high-
stakes test. Numerous universities "require" a certain SA T or ACT score for admit-
tance but state that the admissions process "takes other factors into consideration."
An axiom in assessment by counseling professionals should be that decisions about
peoples' lives should be made using multiple sources of information provided by multiple
respondents. Using a single piece of data or data provided by a single source to make
an important decision about a person is just plain wrong.
Tests Are Biased and Unfair to Minorities and Women
This issue receives far greater treatment later in this chapter, but for now it is impor-
tant to understand that tests are used to predict some performance criterion, and that
the concepts of fairness and bias have to do with how effectively tests accomplish this
goal for differing groups of individuals (e.g., race, gender). Thus, if an intelligence
test differentially holds some groups to an advantage and others at a disadvantage in
predicting the performance criterion, it could be biased. In modern practice, test au-
thors regularly go to great lengths to ensure fairness in test content, but because cul-
tures vary, bias of individual items may vary also.
Of course, it is essential that the performance criterion be equally free from bias.
An example is the sometimes-reported observation that standardized achievement
tests must be biased against girls because boys sometimes outperform girls on mul-
tiple-choice tests, but girls get higher grades in school-based classes. It is easy to jump
to this conclusion, except for one thing. Consider that the standardized test scores
are objectively derived and subjected to bias analyses. Can the same claim be made
for school grades? Nearly any school teacher will confirm that, on average, girls turn
in homework more frequently, prepare for exams and study more, are better behaved
in the classroom, and generally get higher test scores than boys. If this is the case,
girls should get higher grades than boys, but higher grades do not necessarily mean
that one knows more or has better mastery of the course content. Given this context,
it is just as logical to conclude that the criterion (grades) is more biased against boys
than standardized tests against girls. The point is, always consider the bias and fair-
ness of both the predictor (i.e., the variable/test score used to predict the criterion)
and the criterion.
Foundations of Assessment 65
Tests Create Anxiety and Stress
That tests create anxiety and stress is, of course, true; but not always in the way many
fear. Large-scale group-administered testing certainly creates a degree of stress that,
hopefully, reaches a moderate level. Remember the Yerkes-Dodson law: Moderate
anxiety maximizes performance; low and high anxiety minimize performance. Of
greatest concern is a student's phobic or panicked reaction due to a high degree of
anxiety, usually with high-stakes tests. While there is certainly anecdotal evidence to
support this claim of high degrees of pressure being placed upon students (includ-
ing physical illness, vomiting, and crying), this claim is not true for the vast major-
ity of students. Professional counselors understand that a small percentage of the
population suffers from test phobia and take steps to treat it when appropriate.
Professional counselors also understand that a significant proportion of the school-
aged population may be diagnosed with an anxiety disorder (see Chapter7), usually
Generalized Anxiety Disorder, and take steps to treat these difficulties when appro-
priate. Anxious people are likely to get upset about tests and myriad other life events.
Professionals need to predict who will be affected and to take preventive and inter-
ventive measures. All told, the vast majority of individuals are not harmed or unduly
upset by standardized testing. In fact, most educators are far more concerned about
the other end of the spectrum — unmotivated students who care too little and do not
get anxious enough about tests.
Tests Label and Categorize
While it is true that tests label and categorize, technically speaking, it is the decision
makers (e.g., professional counselors, multidisciplinary team members, or mental
health professionals) who label and categorize. Frequently, labeling is a necessary evil
in society because labels are used to identify individuals in need of, and entitled to,
services. For example, identifying a child with a learning disability is a step toward
obtaining the educational services the child may need for academic success. Clinical
tests are often used to identify individuals with mental disorders so that third-party
(i.e., insurance company) reimbursement can be obtained for counseling services. In
this way, tests can be a valuable aid in making more accurate decisions about the cat-
egories that clients and students are determined to fit.
While the public holds many concerns about labeling of clients and students,
much of the concern about the use of labels lies in two areas: (1) that tests may be
used to mislabel an individual, and (2) that labels may be used as an excuse for some
remediable (or even nonexistent) condition. Professional counselors must always be
aware of the potential for misidentification. Tests are not perfect predictors; nothing
is. Tests are instruments that inform the decisions of professional counselors and
must be used with other sources and types of information to arrive at accurate deci-
sions. Inaccurate labels tend to have detrimental consequences for clients, sometimes
lasting for many years. For example, a 7-year-old boy inaccurately identified as men-
tally retarded may spend three or more years in an instructional program specially
designed for students with mental retardation. A young man inaccurately diagnosed
66 Chapter 2
with schizophrenia may not only receive improper treatment, but be followed by an
erroneous paper trail and even wrongful discrimination in the workplace.
Others may use a label as an excuse for not trying in school or not pursuing
effective treatment strategies. For example, children with Attention-Deficit/
Hyperactivity Disorder (AD/HD) may use the condition as an excuse for not try-
ing hard in math. Worse, teachers and parents may use the diagnosis as an excuse
for not encouraging such students to put more effort into their studies. Excuses
such as, "He has a poor memory," "She can't write well so shouldn't be expected
to," or "He'll always be disorganized" may be true to a certain degree but also may
become self-fulfilling prophecies with no effort put forth to ever cope and com-
pensate for difficulties.
Test Developers Dictate What Students Must Know or Learn
Developers of achievement tests select items that measure the domain of knowledge
being assessed. They use several methods in this process, including curriculum and
textbook reviews, reviews of previously available tests, and consultation and evalua-
tion of experts in the given content area. The goal is to develop a test that faithfully
and accurately samples the domain of knowledge. In today's standards-based and
large-scale (group) assessment atmosphere, it is common for state departments of ed-
ucation to develop their own learning standards and instructional objectives and to
contract with publishers to measure those standards and objectives. Good curricu-
lum evaluation starts with well-defined standards, which are then implemented
through an effective curriculum (including benchmarks, instructional objectives,
and instructional activities) and appropriately assessed.
The key is for the test or assessment program to align perfectly with the curricu-
lum, and for the curriculum to align perfectly with the standards. In the past, many
large-scale achievement tests were "off the shelf" and thus may or may not have
aligned with a given school's curriculum. Misalignment can result in lowered test
scores. For example, if a curriculum teaches only half of what an achievement test
measures (i.e., 50% overlap between test and curriculum), then low scores will result.
Unfortunately, it was difficult for educators to determine whether low student scores
were due to misalignment (i.e., students were not taught half of what they needed to
know to do well on the test) or poor skills (i.e., students did not master the half of
the items that they were taught).
Recently, educators and test publishers have worked collaboratively to develop
large-scale tests that are tailored to state needs and aligned with state learning stan-
dards. Frequently, these tests are composed of "off the shelf" items that do apply
to the state standards and are augmented with item pools that measure additional
specific state standards. In this way, test items align more precisely with state stan-
dards, and the burden is on school systems and individual teachers to develop and
implement an effective curriculum. The mechanics of this issue is addressed in the
chapter on high-stakes testing, which is available on the companion website for
this text.
Foundations of Assessment 67
'Teaching to the Test" Inflates Scores
As a continuation of the previous criticism, teachers are supposed to implement a
curriculum that provides the bridge between standards and assessment. "Teaching to
the test," a phrased loathed by most educators, means that the focus of instruction
becomes so precribed that only content that is sure to appear on an exam is addressed
in instruction. Obviously, if this occurs, test scores should rise.
Whether test scores are inflated in this instance is a matter of content mastery.
Consider an example from the classroom. Teachers "teach to the test" all the time in
the regular curriculum. They have a learning objective — say single-times-single-digit
multiplication (e.g., 3x6= 18, 7x8 = 56); instruct students in the process for ar-
riving at correct solutions; assign activities in class and for homework to enhance stu-
dent mastery; and then, finally, test student knowledge with some kind of teacher-
made or textbook examination. If the students are prepared and motivated, and the
teacher implements the instruction efficiently, students should receive high scores.
Whether the scores are inflated depends on whether the student scores reflect mas-
tery of the domain of behavior — that is, can the students effectively solve nearly all
single-by-single-digit multiplication problems. If the answer is yes, great — that was
the goal. In contrast, assume the teacher decides ahead of time that the test will be
comprised of 10 items and the students are instructed and drilled only on those 10
items. It is quite likely that the students will do very well on the examination but not
be very proficient at calculating items from the broader domain. In this example, the
test scores do not accurately reflect the level of mastery of the total domain. As a re-
sult, it can be said that the scores are inflated.
To solve this dilemma, test publishers, state education departments, and local
educators must work collaboratively to develop test items that adequately sample the
broad content domain and standards. Equally important, these entities must protect
and secure the test content so that teachers do not know which items will appear.
This ensures that student test performance reflects content mastery, not the teaching
of how to solve specific items. In the end, if teachers understand the standards, are
provided with an effective curriculum and material resources, and effectively imple-
ment the instructional strategies, then motivated, prepared students will master the
domain of knowledge being assessed. (Note that there are a lot of "ifs"!)
Multiple-Choice Questions Punish Intelligent, Creative Thinkers;
Trivialize the Complexities of the Learning Process;
and Reward Good Guessers
While multiple-choice questions can effectively measure knowledge and skills in di-
verse areas, it would be absurd to propose that they can effectively measure every-
thing. Sometimes extended-response items (e.g., essays) or performance evaluations
are necessary because they allow for the assessment of applied skills and more thor-
ough explanations. For example, in the training of professional counselors, it is a
necessary and common occurrence for the trainee to be observed actually counseling
68 Chapter 2
clients, either live or on video. No multiple-choice or essay test can substitute for this
performance assessment. That is not to say that certain knowledge components of
the counseling process cannot be tested — only that the act of counseling is a fluid,
applied process that happens with real people. In some instances, indirect measures
cannot be substituted for direct measures.
Whether multiple-choice items measure trivial or meaningful information is re-
ally in the hands of the test developer. Remember from the discussion above that test
items are created to measure some standard or objective so that an inference can be
made about the mastery of a domain of behavior. Thus if the standard or objective
is trivial, so will be the question. Well-crafted multiple-choice questions can meas-
ure advanced, high-level thinking every bit as well as other response formats. It all
comes down to the skill of the item writer.
The criticism is often made that students who are "good guessers" or "lucky
guessers" can get significantly higher scores on a multiple-choice test. However, the
facts simply do not support this assertion. On a typical four-choice, multiple-choice
question, the likelihood of getting a question correct just by guessing is 25% (0.25).
Now if the test has very few items on it, getting one additional question correct
might make a difference, but most large-scale assessments have hundreds of ques-
tions, and subtests usually have dozens. Thus to get an appreciably higher score, one
would have to guess correctly on several to perhaps dozens of questions. Anyone can
beat the odds, of course; but what are the odds of beating the odds? Let's use as an
example that students would need to guess correctly on four questions in order to ap-
preciably increase their score. When you know that a student has a 25% (0.25)
chance of guessing correctly on each item, the odds are easy to compute: 0.25 X 0.25
X 0.25 X 0.25 = 0.004 — a 0.4% chance of guessing correctly on all four items. This
means that 4 out of 1,000 students taking that subtest might get a substantially
higher score. Now if one is a die-hard gambler, these odds are about four times
higher than hitting the "Pick 3" Lotto — something to get excited about, perhaps.
But in the assessment arena, few would bet their college admission prospects, or their
grade in an assessment course, on them.
Learning From Past Mistakes and Criticisms
Periodically throughout history the use of tests has come under attack, and such at-
tacks sometimes limit the widespread application of test use in society. These move-
ments are often double edged; they highlight fair criticisms of the power that tests
sometimes wield in decision making but fail to replace the current system with one
that is more objective, accurate, and fair. This is the dilemma: Tests have risen to cur-
rent prominence because they provide more objective, accurate, and hur information
on which decisions can be made . . . but . . . because no test is perfect, errors can and
do occur in the decisions made. What critics often fail to mention is that a systematic
decision-making process using standardized tests most often results in fewer poor de-
cisions than a nonsystcmatic decision-making process based on "judgment," in which
the decision maker becomes the instrument (more on this in Chapter 7). Individuals
exercising judgment are just as susceptible to threats to reliability and validity as tests.
Foundations of Assessment 69
To prevent biased judgments, professional counselors receive substantial train-
ing in assessment. Professional counselors must understand the important concepts
that guide the development of assessment instruments in order to become informed
consumers. The future of assessment in counseling depends on professional coun-
selors being able to use assessments effectively to benefit students and clients, to base
their decisions on objective facts, and to replicate and justify those decisions on the
basis of scientific evidence, not subjective "feel." Professional counselors have a pro-
fessional duty and responsibility to know as much as they can about all facets of
counseling in order to best serve and advocate for students and clients.
ETHICS AND ASSESSMENT
Counseling, like many other professions, is guided both by laws and by ethical stan-
dards. Laws regulate who can perform what type of counseling, in which settings,
and with which clients. Additionally, in the area of assessment, myriad policies and
procedures regulate who can be or is assessed, under what circumstances, for what
reasons, and who is qualified to administer and interpret the assessments. However,
despite the controls that exist within the area of assessment, there is still tremendous
room for judgment on the part of the professional regarding these issues.
Responsibility for final decisions regarding conduct rests with counselors themselves
(Wickwire, 2002). In the absence of laws, policies, and procedures, ethical standards
are the basis for appropriate and professional behavior. Codes of ethics propose
guidelines for standards of professional behavior, and it is essential for professional
counselors to be familiar with and follow these standards in order to provide high-
quality, professional counseling services.
Both laws and ethical standards are based on generally accepted societal norms,
beliefs, customs, and values (Fischer & Sorenson, 1997) and exist for the good of so-
ciety. However, laws are more prescriptive, have been codified, and generally carry
penalties for failure to comply. Ethical standards are generally developed by profes-
sional associations to guide the behavior of a particular group of professionals.
According to Herlihy and Corey (1996), ethical standards serve three purposes: to
educate members about sound ethical behavior, to provide a mechanism for account-
ability, and to serve as a means for improving professional practice. They also serve
a fourth purpose — to educate, and therefore protect, the public about the standards
of behavior they can expect from a particular group of professionals. Associations pe-
riodically update their ethical codes to ensure continuing relevance and applicability
and involve stakeholders in the process. The enforcement of ethical standards is the
responsibility of the association, which is usually limited in what it can do to mem-
bers who fail to comply. It is the responsibility of each member to voluntarily com-
ply and behave ethically because it is the right thing to do, although sanctions for
noncompliance may occur.
Forester-Miller and Davis (1996) suggested that Kitchener's five moral princi-
ples are the cornerstone of the American Counseling Association's ethical standards.
The first is autonomy, which refers to clients' independence and right to make sound
and rational decisions on their own. Nonmaleficence is often referred to as "do no
70 Chapter 2
harm"; professional counselors must avoid behaviors that place clients at risk or
could potentially cause harm. Beneficence involves contributing to the positive wel-
fare of clients and their growth. Justice means treating each client according to what
is best for that client — fair treatment and consideration of each client. The last prin-
ciple \s fidelity, which refers to honoring commitments and establishing an accepting
relationship in which the client can trust the professional counselor. These moral
principles are critically important in the field of assessment to ensure that clients re-
ceive professional and appropriate services that are in their best interest.
There are a number of codes of ethical standards, since different associations
and divisions within the counseling profession promulgate their own codes.
However, since all of the ethical standards are based on either the moral principles
previously discussed or similar common values, the similarities among the codes
are greater than the differences. These differences usually pertain to workplace set-
ting. The American Counseling Association's Code of Ethics (2005a) will be used
as the basis for the discussion that follows here. The Code of Ethics delineates the
responsibilities of professional counselors toward their clients, their colleagues, the
workplace, and themselves. It is divided into eight sections: The Counseling Rela-
tionship; Confidentiality; Privileged Communication and Privacy; Professional
Responsibility; Relationships With Other Professionals; Evaluation, Assessment,
and Interpretation; Supervision, Training, and Teaching; Research and Publication;
and Resolving Ethical Issues. Section E: Evaluation, Assessment, and Interpretation
is reviewed below.
Section E: Evaluation, Assessment, and Interpretation covers standards related to
the assessment of clients, the counselor's skills, and appropriateness of assessment,
including: general appraisal issues, competence to use and interpret tests, informed
consent for appraisal, releasing information, proper diagnosis of mental disorders,
test selection, conditions of test administration, diversity in testing, test scoring
and interpretation, test security, obsolete tests and outdated test results, and test
construction.
Each subsection delineated below in italics is quoted from the ACA Code of
Ethics (2005a) and accompanied by commentary.
Section E: Evaluation, Assessment, and Interpretation
Introduction. Counselors use assessment instruments as one component of the coun-
seling process, taking into account the client personal and cultural context.
Counselors promote the well-being of individual clients or groups of clients by devel-
oping and using appropriate educational, psychological, and career assessment in-
struments.
E.I. General
E.I. a. Assessment. The primary purpose of educational, psychological, and career
assessment is to provide measurements that are valid and reliable in either compar-
ative or absolute terms. These include, but are not limited to, measurements of abil-
ity, personality, interest, intelligence, achievement, and perform, nice. Counselors rec-
ognize the need to interpret the statements in this section as applying to both
(jit, imitative and qualitative assessments.
Foundations of Assessment 71
E.l.b. Client Welfare. Counselors do not misuse assessment results and interpreta-
tions, and they take reasonable steps to prevent others from misusing the information
these techniques provide. They respect the client's right to know the results, the inter-
pretations made, and the bases for counselors' conclusions and recommendations.
It is the responsibility of the professional counselor to use assessment techniques
and results appropriately and to ensure that others do as well. As mentioned in the
discussion of the moral principles underlying the ethical standards, professional
counselors must operate in the best interest of the client. Salvia and Ysseldyke (2004,
p. 58) go further and state that "those who assess . . . must accept responsibility for
the consequences of their work, and they must make every effort to make certain
their services are used appropriately." In so doing, professional counselors use instru-
ments that will yield reliable and valid scores so that decisions made using these in-
struments will benefit clients.
E.2. Competence to Use and Interpret Assessment Instruments
E.2.a. Limits of Competence. Counselors utilize only those testing and assessment
services for which they have been trained and are competent. Counselors using tech-
nology-assisted test interpretations are trained in the construct being measured and
the specific instrument being used prior to using its technology-based application.
Counselors take reasonable measures to ensure the proper use of psychological and ca-
reer assessment techniques by persons under their supervision. . . .
E.2. b. Appropriate Use. Counselors are responsible for the appropriate application,
scoring, interpretation, and use of assessment instruments relevant to the needs of the
client, whether they score and interpret such assessments themselves or use technology
or other services.
E.2.c. Decisions Based on Results. Counselors responsible for decisions involving in-
dividuals or policies that are based on assessment results have a thorough understand-
ing of educational, psychological, and career measurement, including validation cri-
teria, assessment research, and guidelines for assessment development and use.
Professional associations, employers, test publishers, and test users have put safe-
guards in place to ensure the qualifications of professionals using assessments. A
number of guidelines and resources have been developed to assist professional coun-
selors in this area, including the RUSTS statement (AACE, 2003a) and the
Standards for Educational and Psychological Testing (AERA et al., 1999). These guide-
lines and resources are discussed in other chapters of this book. However, the respon-
sibility for appropriate use and interpretation of assessments lies with the profes-
sional counselor. Professional counselors should conduct a thorough search to ensure
that the instrument or assessments selected are appropriate for the client, the in-
tended purpose, and the information needed (Wickwire, 2002). Additionally, the
professional counselor must be trained in the assessment procedure and qualified to
conduct the assessment. Often students take assessment classes in graduate school
but gain little additional training during their careers. Thorndike (1997) suggested
that the assessor withdraw from the process if insufficiently trained to provide the
quality of services and expertise required. Professional counselors have an obligation
72 Chapter 2
to maintain or increase their expertise in the area of assessment if they are going to
conduct assessment activities.
Standard E.2 mandates that professional counselors receive periodic training
and retraining on assessments used. Just as important, simply knowing how to ad-
minister and score a test does not satisfy this requirement. Professional counselors
endeavor to know as much as possible about the construct or content under study,
including the test psychometrics, purposes for which the test has been validated, and
other research related to the test's use.
Professional counselors are highly trained and ensure that those under their su-
pervision are trained to use assessments for intended purposes. When supervisees or
employees under a counselor's supervision behave unethically, it is the supervising
professional counselor who bears responsibility for their misactions.
3. Informed Consent in Assessment
E.3.a. Explanation to Clients. Prior to assessment, counselors explain the nature
and purposes of assessment and the specific use of results by potential recipients. The
explanation will be given in the language of the client (or other legally authorized
person on behalf of the client), unless an explicit exception has been agreed upon in
advance. Counselors consider the clients personal or cultural context, the level of the
client's understanding of the results, and the impact of the results on the client. . . .
E.3. b. Recipients of Results. Counselors consider the examinee's welfare, explicit un-
derstandings, and prior agreements in determining who receives the assessment re-
sults. Counselors include accurate and appropriate interpretations with any release of
individual or group assessment results. . . .
E.4. Release of Data to Qualified Professionals
Counselors release assessment data in which the client is identified only with the con-
sent of the client or the client's legal representative. Such data are released only to per-
sons recognized by counselors as qualified to interpret the data. . . .
Informed consent implies that the person granting permission understands ex-
actly what assessments will be conducted, why the assessments are being conducted,
what will happen to the results, and who will be given the results. Confidentiality is
the cornerstone of counseling and is critical to the area of assessment, particularly
when the assessment concerns very personal questions or asks for sensitive informa-
tion. Frequently, permission to conduct assessments requires signed, informed con-
sent from either the client or, in the case of a minor child, the parent or legal
guardian. The legitimacy of informed consent rests upon three essential fleets: capac-
ity, comprehension, and voluntariness. Capacity refers to the right one holds to con-
sent. For example, precious few circumstances exist that would allow a 9-year-old
boy the right to consent to anything. This is because in the United States, the par-
ent or legal guardian almost always holds this right. Likewise, someone who has
mental retardation or is mentally disabled may not have the ability to consent.
Comprehension means the consenter understands the implications of consent. II the
evaluator cannot communicate the purpose of the assessment in a language or terms
the client can understand, consent cannot be obtained. Voluntariness means the as-
Foundations of Assessment 73
sessment involves no coercion or duress. As with any ethically conducted research
study, a client has the right to withdraw from an assessment at any time.
The Family Educational Rights and Privacy Act of 1974 (FERPA) and subse-
quent amendments govern student records in schools and universities. FERPA man-
dates that only those persons with a legitimate educational interest have the right to
access a student's records, including assessment information, and that psychological
evaluations and some other assessments and surveys require signed, informed con-
sent. In school settings, it may be clearer who has a legitimate need to access a stu-
dent's assessment results, but there may also be more professionals involved due to
the number of support staff and teams operating within schools. Professional coun-
selors should ensure that the persons with whom assessment results are shared,
whether in the clinic or at school team meetings, have a legitimate need to know the
results and are fully capable of understanding the results. Professional counselors
must also safeguard the maintenance of assessment protocols and results. Under nor-
mal circumstances, protocols and raw interview data are released only with client
permission and only to professionals who can understand and use the information
to make decisions in the best interest of the client.
The same limits to confidentiality that exist within the counseling relationship
also exist within the assessment area unless informed consent is provided. The client
(or parent or guardian of a minor) always has the right to request in writing that in-
formation be shared. Professional counselors must be aware that assessment infor-
mation is subject to court orders and subpoenas and duty-to-warn situations. In ad-
dition, sharing information with third parties (e.g., insurance companies); allowing
clerks, secretaries, and other personnel to handle assessment information; and con-
sultation are all legitimate limitations to confidentiality.
Think About It 2.2 What makes confidentiality and informed consent
such important aspects of assessment?
E.5. Diagnosis of Mental Disorders
E.5.a. Proper Diagnosis. Counselors take special care to provide proper diagnosis of
mental disorders. Assessment techniques (including personal interview) used to deter-
mine client care (e.g., locus of treatment, type of treatment, or recommended follow-
up) are carefully selected and appropriately used.
E.5. b. Cultural Sensitivity. Counselors recognize that culture affects the manner in
which clients' problems are defined. Clients' socioeconomic and cultural experiences
are considered when diagnosing mental disorders. . . .
E.5.c. Historical and Social Prejudices in the Diagnosis of Pathology. Counselors
recognize historical and social prejudices in the misdiagnosis and pathologizing of
certain individuals and groups and the role of mental health professionals in perpet-
uating these prejudices through diagnosis and treatment.
E.5.d. Refraining from Diagnosis. Counselors may refrain from making and/or re-
porting a diagnosis if they believe it would cause harm to the client or others.
74 Chapter 2
Standard 8.8 of the Standards for Educational and Psychological Testing (AERA et
al., 1999) advises that the least stigmatizing label should always be assigned when re-
porting test results. This does not mean that a less serious code is used, but rather the
diagnosis should be an appropriate one and described precisely. Contextual factors
(e.g., the client's cultural or socioeconomic experiences) must be considered when
diagnosing clients because of the significant impact diagnostic labels can have on a
client's life (Whiston, 2005). In some cases, the diagnostic code drives treatment pro-
tocols and/or payment for treatment. This factor presents a serious dilemma for
many practitioners, as the specified number of sessions for one diagnostic code may
be insufficient to adequately assist the client, while a different code would allow a
sufficient number of sessions. Still, the Code of Ethics requires that professional coun-
selors use the proper diagnosis. A great deal of research is currently under way ex-
ploring the congruence of diagnoses across diverse populations. For example, the
context of living in a low-socioeconomic inner-city neighborhood may elevate the
number of criteria for Conduct Disorder the average adolescent male may meet. But
if these behaviors have become "normative" due to context, is it equitable that the
diagnosis of Conduct Disorder is made at a substantially increased rate for these
inner-city youth? Or should a more culture-normative, context-sensitive process be
pursued? This question is becoming critically important and will likely receive
tremendous attention in the coming years.
E. 6. Instrument Selection
E.6.a. Appropriateness of Instruments. Counselors carefully consider the validity,
reliability, psychometric limitations, and appropriateness of instruments when select-
ing assessments.
E.6.b. Referral Information. If a client is referred to a third party for assessment,
the counselor provides specific referral questions and sufficient objective data about
the client to ensure that appropriate assessment instruments are utilized. . . .
E.6.c. Culturally Diverse Populations. Counselors are cautious when selecting as-
sessments for culturally diverse populations to avoid the use of instruments that lack
appropriate psychometric properties for the client population. . . .
Professional counselors should choose assessments that are the most appropriate
for the targeted purpose of the assessment and for the clients they are assessing
(Anastasi & Urbina, 1997). Doing so may involve a thorough search and evaluation
of potential assessment instruments. According to Wickwire (2002), this step is es-
sential, as the "professional is seeking an appropriate and workable fit, with the high-
est quality and greatest benefit" (p. 8). The implication of "fit" for clients from di-
verse populations is particularly important. Professional counselors must explore
each instrument's psychometric properties and ensure its appropriateness and use-
fulness for clients from diverse cultures.
E.7. Conditions of Assessment Administration . . .
E.7.a. Administration Conditions. Counselors administer assessments under the
same conditions that were established in their standardization. When assessments are
not administered under standard conditions, as may be necessary to accommodate
Foundations of Assessment 75
clients with disabilities, or when unusual behavior or irregularities occur during the
administration, those conditions are noted in interpretation, and the results may be
designated as invalid or of questionable validity.
E.7.b. Technological Administration. Counselors ensure that administration pro-
grams junction properly and provide clients with accurate results when technological
or other electronic methods are used for assessment administration.
E.7.c. Unsupervised Assessments. Unless the assessment instrument is designed, in-
tended, and validated for self-administration and/or scoring, counselors do not per-
mit inadequately supervised use.
E.7.d. Disclosure of Favorable Conditions. Prior to test administration of assess-
ments, conditions that produce most favorable assessment results are made known to
the examinee.
The previous discussion has concerned the need for care in the selection of as-
sessment tools. Equal care must be taken with the use of these tools and the ad-
ministration of all assessments in order to achieve the optimal result. Changing the
way in which assessments are given or the conditions under which they are given
may negate the usefulness and validity of the results. Professional counselors must
be sensitive to conditions that may affect assessment performance (Anastasi &
Urbina, 1997). This awareness is particularly important when some clients are ad-
vantaged by having access to experiences or information about how to perform bet-
ter on a test — sometimes referred to as test sophistication. Certainly an individual
who takes a standardized test and has had multiple exposures to sample test ques-
tions and the "bubble" response format (i.e., penciling in answers on a machine-
scored form) will have advantages over someone who doesn't know what to expect
or how to respond appropriately ahead of time. Professional counselors seek to
"level the playing field" by ensuring that all students have requisite information
and skills.
E.8. Multicultural Issues/ Diversity in Assessment
Counselors use with caution assessment techniques that were normed on populations
other than that of the client. Counselors recognize the effects of age, color, culture,
disability, ethnic group, gender, race, language preference, religion, spirituality, sex-
ual orientation, and socioeconomic status on test administration and interpretation,
and place test results in proper perspective with other relevant factors. . . .
According to recent projections, the United States racial population will ap-
proach 50% non-White by the year 2050. Communities and schools are becoming
increasingly diverse. In some schools, the number of different languages spoken ex-
ceeds 1 50. This increasing diversity poses serious concerns for assessment if profes-
sional counselors are to behave ethically. Diversity concerns are discussed in depth
later in this chapter. For now, it is important to understand that it is the burden of
test authors to demonstrate that the test scores are not affected by diverse examinee
characteristics. In the absence of a declarative statement by test authors in this re-
gard, the examiner should assume that cultural differences may exist and approach
use of the test with culturally diverse clients with caution.
76 Chapter 2
E.9. Scoring and Interpretation of Assessments
E.9.a. Reporting. In reporting assessment results, counselors indicate reservations
that exist regarding validity or reliability due to the circumstances of the assessment
or the inappropriateness of the norms for the person tested.
E.9.b. Research Instruments. Counselors exercise caution when interpreting the re-
sults of research instruments not having sufficient technical data to support respon-
dent results. The specific purposes for the use of such instruments are stated explicitly
to the examinee.
E.9.c. Assessment Services. Counselors who provide assessment scoring and inter-
pretation services to support the assessment process confirm the validity of such in-
terpretations. They accurately describe the purpose, norms, validity, reliability, and
applications of the procedures and any special qualifications applicable to their use.
The public offering of an automated test interpretations service is considered a
professional-to-professional consultation. The formal responsibility of the consul-
tant is to the consultee, but the ultimate and overriding responsibility is to the
client. . . .
Professional counselors are ultimately responsible for the accuracy of the as-
sessment results and must make every effort to ensure that their services are used
appropriately (Salvia & Ysseldyke, 2004) and that the best interest of the client is
served. This is equally true when using computerized interpretive programs. While
information derived from an interpretive report is often accurate and helpful, pro-
fessional counselors realize that these interpretations are based on statistical mod-
els and that the software author has never met the client. Thus, as is always the
case, professional counselors validate and supplement all scores and interpretation
with additional information from multiple sources before making decisions that
affect clients' lives.
Also, while professional counselors strive to administer tests exactly as specified,
mistakes and outside interference do occur. Professional counselors document these
circumstances and consider them when interpreting test scores. If the circumstances
are serious enough to invalidate the test scores, professional counselors state such and
then do not use the invalid scores to describe client performance or make decisions affect-
ing a client's life. If the professional counselor has any reservations about the assess-
ment results, it is the responsibility of the counselor to communicate those reserva-
tions to the client and/or other appropriate parties, such as parents. The professional
counselor must ensure that accurate and appropriate interpretations accompany the
dissemination of any assessment results so that the recipients of the information are
clear as to what the results actually are.
E. 10. Assessment Security
Counselors maintain the integrity and security of tests and other assessment tech-
niques consistent with legal and contractual obligations. Counselors do not appro-
priate, reproduce, or modify published assessments or parts thereof without acknowl-
edgment and permission from the publisher.
Foundations of Assessment 77
E.ll. Obsolete Assessments and Outdated Results
Counselors do not use data or results from assessments that are obsolete or outdated
for the current purpose. Counselors make every effort to prevent the misuse of obso-
lete measures and assessment data by others.
E. 12. Assessment Construction
Counselors use established scientific procedures, relevant standards, and current pro-
fessional knowledge for assessment design in the development, publication, and uti-
lization of educational and psychological assessment techniques.
Professional counselors must preserve the integrity of the assessments and the
accompanying protocols. Testing materials should be stored in a locked facility to
prevent theft or misuse by unauthorized individuals. All published tests are copy-
right protected and cannot be photocopied for use with clients. Tests are very expen-
sive to develop, norm, and print. Development of these products is done through fi-
nancial risks by authors and publishers. For those professional counselors who are
involved with the development of assessments, it is important to adhere to current
scientific standards and methodology. Among numerous sources, the RUST-3 state-
ment (AACE, 2003a) and the Standards for Educational and Psychological Testing
(AEPvA et al., 1999) are important to consult when developing tests.
If the assessment information is outdated, professional counselors must take care
with its use, as the validity and usefulness of the information may be questionable.
In brief, professional counselors should discontinue use of older versions of tests, and
cease using them to make client decisions. However, it is not always easy to make
this call. Previous versions of tests often have a rich research base and numerous stud-
ies exploring psychometric integrity. Also, it is, unfortunately, not unusual for new
norms and new test manuals to have errors. Thus it is often prudent to phase in use
of new instruments and to use the new instrument exclusively once its quality has
been established.
E. 13. Forensic Evaluation: Evaluation for Legal Proceedings
E.13.a. Primary Obligations. When providing forensic evaluations, the primary
obligation of counselors is to produce objective findings that can be substantiated
based on information and techniques appropriate to the evaluation, which may in-
clude examination of the individual andl or review of records. Counselors are entitled
to form professional opinions based on their professional knowledge and expertise that
can be supported by the data gathered in evaluations. Counselors will define the lim-
its of their reports or testimony, especially when an examination of the individual
has not been conducted.
E. 13b. Consent for Evaluation. Individuals being evaluated are informed in writ-
ing that the relationship is for purposes of an evaluation and is not counseling in na-
ture, and entities or individuals who will receive the evaluation report are identi-
fied. Written consent to be evaluated is obtained from those being evaluated unless a
court orders evaluations to be conducted without the written consent of individuals
78 Chapter 2
being evaluated. When children or vulnerable adults are being evaluated, informed
written consent is obtained from a parent or guardian.
E.13.C. Client Evaluation Prohibited. Counselors do not evaluate individuals for
forensic purposes they currently counsel or individuals they have counseled in the past.
Counselors do not accept as counseling clients individuals they are evaluating or in-
dividuals they have evaluated in the past for forensic purposes.
E.13.d. Avoid Potentially Harmful Relationships. Counselors who provide foren-
sic evaluations avoid potentially harmful professional or personal relationships with
family members, romantic partners, and close friends of individuals they are evalu-
ating or have evaluated in the past.
Forensic evaluation and court testimony is a burgeoning specialty within coun-
seling, psychology, and psychiatry. The standard regarding avoidance of potentially
harmful relationships is a new addition to the 2005 Code of Ethics and seeks to make
sure that professional counselors understand the importance of making inferences
based on firsthand knowledge of the client, rather than speculation or generalities.
Professional counselors can expect much more attention to this area of study in the
future because of the increasing need of courts, lawyers, and those accused of crimes
to have mental health experts provide testimony regarding psychological status. Also,
this is another issue that psychological boards across the country are pursuing in
order to attempt to limit the scope of professional counselors' practice.
Source: Section E of the ACA Code of Ethics and Standards of Practice has been
reprinted with permission. No further reproduction is authorized without
written permission from the American Counseling Association.
Think About It 2.3 When assessments are conducted with clients and
students, it is essential that results be used correctly. What are some conse-
quences of inappropriate use? How could these problems be resolved?
While thev4G4 Code of Ethics is helpful in describing ethical test use, the reader
is again referred to the RUST-3 statement for a comprehensive and explanatory trea-
tise of responsible, professional test use. Assessment information, used in conjunc-
tion with other sources of information about the client, can be extremely useful in
working with clients. As can be seen from this discussion, it is critically important for
professional counselors to practice ethically in order to do no harm. But what should
a professional counselor do if unsure of the correct ethical course of action? For an-
swers, we now turn to a brief discussion of ethical decision making as applied to as-
sessment issues.
ETHICAL DECISION MAKING
One of the greatest professional challenges facing most counselors is ethical behav-
ior — that is, determining the ethically appropriate course of action in any situation.
Professional counselors must also be acutely aware of the behavior of their colleagues
Foundations of Assessment 79
and have a responsibility to act if a colleague is behaving in an unethical manner. To
assist professional counselors with these issues, the ACA's Ethics Committee devel-
oped the Practitioner's Guide to Ethical Decision Making (Forester-Miller & Davis,
1996), which delineates a seven-step model for working through ethical dilemmas:
1. Identify the problem. One should gather all relevant information and determine
whether the problem is an ethical issue or a legal, practice, or other issue. If it is
an ethical issue, continue with the process.
2. Apply the ACA Code of Ethics (2005a). Determine which section of the ACA
Code of Ethics addresses the issue most directly. The relevant section may outline
the course of action to follow. If the answer is not indicated, then one should
proceed to the next step of the model.
3. Determine the nature and dimensions of the dilemma. Forester-Miller and Davis
suggested that professional counselors should consider the moral principles that
underlie the Code of Ethics for direction, current research, and consultation to
determine an appropriate course of action.
4. Generate potential courses of action. Professional counselors should consult at least
one colleague to ensure that all potential courses of action are identified.
5. Consider the potential consequences of all options and determine a course of action.
The impact of potential consequences on the client, professional counselor, and
others should be considered in determining which option is optimal for address-
ing the dilemma.
6. Evaluate the selected course of action. Evaluate the selected course of action to en-
sure that implementing that choice will not create new or additional ethical
dilemmas.
7. Implement the course of action. The professional counselor should implement the
selected course and follow up to ensure that the selected action had the desired
outcome.
The following scenario highlights the use of the ethical decision-making model
in practice for an assessment-related issue. The Student Services Team (SST) at
Happy Days Middle School meets once a month to discuss students who are expe-
riencing problems that are interfering with their ability to be successful academically
or socially in school. Ms. Jones is a licensed professional counselor who works in the
school-based mental health center and routinely attends the SST meetings as a team
member. A student new to the school who was experiencing both academic and so-
cial difficulty was referred for assessment. At the meeting the next month, the results
of the student's assessment were presented and discussed. Ms. Jones reviewed the as-
sessment results and had a number of concerns. In particular, she questioned
whether the assessments used were appropriate for the student, wanted to know why
an older version of the WISC had been used, and also questioned whether the per-
son administering the assessments (the learning disabilities teacher) was qualified to
do so. When she tried to raise these issues, the SST members ignored her concerns
and agreed to change the student's program based on the assessment results and an-
ecdotal information.
Ms. Jones believed that this situation was an ethical dilemma and therefore used
the ethical decision-making model. She first identified the problem and then applied
80 Chapter 2
the ACA Code of Ethics. In this case, she identified three problems and the applica-
ble sections of the Code of Ethics: the use of obsolete and inappropriate assessment in-
struments (E.6.a), the competence of the person administering the assessments
(E.2.a), and the use of the assessment results in placement (E.2.b). To determine the
nature and dimensions of the issue, she went to her supervisor to discuss her con-
cerns. Since she is not employed by the school system, she wanted to make sure that
she was considering all facets of the situation and recognized that perhaps there were
processes in the schools she did not understand.
Ms. Jones concluded that the problems she had identified were ethical dilemmas
in this case and suspected that they might also exist in other cases as well. She then
determined possible courses of action. Ms. Jones's supervisor identified a supervisor
in the school system with whom Ms. Jones could discuss her concerns. Ms. Jones
also thought about going back to the team and discussing her concerns again, and
also talking with the person who performed the assessment to determine why these
particular assessments were used and what credentials the assessor held. After consid-
ering all options and their potential consequences, Ms. Jones chose to speak to the
assessor. She felt this was particularly important since the Code of Ethics also indi-
cates that if one is concerned about the ethical behavior of a colleague, the first step
is to discuss the concern directly with the colleague, even one who is not a counselor
bound to uphold the ACA Code of Ethics.
Through Ms. Jones's discussion with the assessor, it became clear to her that the
assessor lacked the experience and training to conduct an assessment using current
tools and that the school system had not purchased current versions of assessments
and had not provided appropriate professional development for the staff. Ms. Jones
then went to the school system supervisor to discuss her concerns. As a result of this
discussion, the school system recognized the need to change some of its practices,
and the assessments for the student in question were redone by a qualified assessor
using current tests.
As Ms. Jones discovered, professional counselors must continually review their
behavior and that of their colleagues to ensure that the best interests of the client al-
ways come first, that their practice reflects current best practices, that they use and/or
interpret only those assessments for which they are trained, and that the assessments
chosen are appropriate for the client and the intended purpose.
LEGAL ISSUES IN ASSESSMENT
While ethical issues in assessment are important, professional counselors must be
even more aware of important legal rulings. Ethical codes represent high standards
of professional practice; however, laws must be followed, even if they conflict with
ethical standards. Both federal and state legislatures enact legislation that impacts the
way professional counselors must practice. Local boards of education, state and local
agencies, and other organizations also implement regulations and policies that im-
pact counseling practice. While not the same as laws, regulations and policies gov-
ern the practices of the professionals to whom they pertain. For example, a licensed
professional counselor (IPC.) who violates a state regulation can be cited or even
Foundations of Assessment 81
sanctioned. A professional school counselor who violates a school board policy can
be reprimanded or even terminated for cause. These steps can be taken because pro-
fessionals who are licensed, certified, or employed are frequently required to abide by
such regulations as a condition of licensure, certification, or employment.
While the purpose of laws is not specifically to direct or limit assessment, they
have been enacted to protect the rights of clients, students, parents, and employees,
and therefore influence how assessment may or must be conducted. Case law is the
result of litigation or court cases and often does direct how professional counselors
must practice. Professional counselors need to keep current with legislation and
court cases, as this is an ever-changing area. Some of the major legal issues affecting
assessment are reviewed in the rest of this section.
The Family Educational Rights and Privacy Act of 1974 (FERPA)
and Related Legislation
Prior to the 1970s, educators and researchers frequently conducted assessments with-
out parental consent and often stored these assessments in student files. In addition,
access to student files was virtually unlimited; a simple request to the principal was
often enough to get access to a student's files by entities, professionals, and employ-
ers outside of a school system. The Family Educational Rights and Privacy Act of
1974 (FERPA) is the federal law that protects the privacy of all student records in
schools and institutions of higher learning. Often referred to as the Buckley
Amendment, this law has several provisions and applies to all pre-K-12 and postsec-
ondary institutions that receive federal funding from the U.S. Department of
Education for any program. Nonpublic schools that do not accept federal funding
are exempt from these regulations.
FERPA defines education records as all information a school collects for atten-
dance, achievement, group and individual testing and assessment, behavior, and
school activities. FERPA gives parents specific rights regarding this information. The
first provision is that parents have the right to inspect and review their children's
records. Each school system must annually send a notice to parents detailing this re-
view process and the procedure for filing a complaint if they disagree with anything
in the record. The school system has 45 days in which to comply with the parents'
request to review the record and faces penalties, including the loss of all applicable
federal funding, for failure to comply. Second, the law limits who may access records.
Under FERPA, only those persons with a "legitimate educational interest" can ac-
cess a student's record. Some personally identifiable information may be released
without parental consent. This information is usually referred to as directory informa-
tion, or public information, and generally includes such material as the student's
name, address, telephone number, date and place of birth, honors and awards, and
attendance records. The major exemption to the confidentiality of student records
relates to law enforcement issues. The school must comply with a judicial order or
lawfully executed subpoena. In cases of emergency, information about the student
relevant to the emergency can be released without parental consent (see www.ed
.gov/print/policy/gen/guid/fpco/ferpa/index.html for details). All states and local
82 Chapter 2
jurisdictions have incorporated FERPA's requirements into state statutes and local
policies with some degree of variance among specifics, such as directory information.
The rights of consent transfer to students upon their 18th birthday. The law
does not specifically limit the rights of parents whose children are over the age of 18
and continue to attend a secondary school (i.e., high school). The law also does not
specifically limit parental rights for a student who attends a postsecondary institu-
tion but is older than 18 years, although most institutions of higher learning adhere
to a policy of informed consent for a student who is 18 years or older. Noncustodial
parents have the same rights as custodial parents, unless a court order has limited or
terminated the rights of one or both parents. Stepparents and other family members
who do not have legal custody of the child have no rights under FERPA without
court-appointed authority.
The Protection of Pupil Rights Amendment of 1978 (PPRA), often referred to
as the Hatch Amendment or the Grassley Amendment, for the members of Congress
who introduced it, gives parents additional rights with regard to surveying minor
students. PPRA does not apply to postsecondary schools. If the survey is funded with
federal money, informed consent must be obtained for all participating students if
students are required to take the survey and if questions about particular personal
areas are asked. PPRA also requires informed parent consent for any psychological,
psychiatric, or medical examination, testing, or treatment of students or any school
program designed to affect the personal values or behavior of students. PPRA also
gives parents the right to review instructional materials in experimental programs.
The No Child Left Behind Act of 2001 includes several changes to FERPA and
PPRA (see www.ed.gov/about/offices/list/index.html for specific details). The
changes apply to surveys funded in whole or part by any program administered by
the U.S. Department of Education (USDE). PPRA (20 U.S.C. 1232h) requires that
schools and contractors make instructional materials available for review by parents
of participating students if those materials will be used in any USDE-funded survey,
analysis, or evaluation and that schools and contractors obtain written parent con-
sent prior to the participation of minor children in any USDE-funded survey, analy-
sis, or evaluation if information in any of the following areas would be revealed:
■ Political affiliations or beliefs of the student or parent
■ Mental and psychological problems of the student or family
■ Sex behavior or attitudes
■ Illegal, antisocial, self-incriminating, or demeaning behavior
■ Critical appraisals of other individuals with whom respondents have close fam-
ily relationships
■ Legally recognized privileged or analogous relationships, such as those of lawyers,
physicians, and ministers
■ Religious practices, affiliations, or beliefs of the student or the students parent
■ Income other than such information required to determine eligibility/participa-
tion in a program
These new provisions of PPRA also apply to any survey that is not funded in any
way with USDE money. Under these provisions, parents have the right to inspect,
Foundations of Assessment 83
upon request, any survey or instructional materials used as part of the curriculum cre-
ated by a third parry if one or more of the eighr above-outlined areas are involved.
Parents also have the right to inspect any instrument used to collect personal informa-
tion from students for marketing or selling. Parents may opt their child out of this
data collection process or any survey involving one or more of the eight above-
delineated areas. PPRA does not apply to any survey that is administered as part of the
Individuals with Disabilities Education Improvement Act of 2004 (IDEIA).
As can be ascertained from the explanation of FERPA and PPRA, there are
many constraints to assessment, testing, and surveys in public schools. As each
school district may further define policies involving this legislation, it is critical for
professional counselors to become familiar with what types of assessments fall under
these regulations, how the assessment results may be used or disseminated, and to
whom. It has become increasingly difficult for professional counselors to give any
type of formal or informal assessment to students without informed parent consent.
And other school mental health professionals, such as school psychologists, may have
even more restrictions placed on their ability to conduct any form of assessment
without signed, informed parent consent. One assessment issue that is becoming
more problematic in schools concerns the desire of parents to review the actual pro-
tocol used after their child has completed the assessment. The problem revolves
around the issue of whether the actual assessment forms become part of the educa-
tional record or just the results. Most professional associations believe that the actual
protocol is not part of the record and that parents usually lack the training to com-
pletely understand the assessment tools.
FERPA, PPRA, NCLB, and related legislation all have provisions aimed at pro-
tecting the rights of school-aged children and their parents from the collection of in-
formation that violates the privacy of all students. Additional provisions have also
been put in place to protect the rights of handicapped citizens; these provisions are
discussed in the section on IDEIA later in this chapter.
Minimal Competency Assessment and
the No Child Left Behind Act of 2001
"High-stakes" testing has been used in education for years, starting with the initial
premise that all students should master the basics of a curriculum before being
granted a diploma. Such a premise has tremendous support among adults in the
United States, but establishing minimal competency for graduation has a controver-
sial sociopolitical dimension.
In the 1970s, many states began to develop minimal competency tests as a re-
quirement for graduation. The Debra v. Turlington (1979) case questioned in the
Florida state courts the Florida State Assessment Test. Lawyers for 10 African American
students who had been denied diplomas on the basis of their failure to pass the state
assessment examination argued that the test was discriminatory because the students
had been educated in a segregated system and had not acquired the skills that would
have allowed them to pass the test. The judge ruled that the test was not discrimina-
tory but did suspend its use for four years and directed that the school must show the
84 Chapter 2
assessment covered only information taught. While the intent behind minimal com-
petency assessment was noble, educators and legislators soon realized that such a sys-
tem revolved around low expectations rather than a striving for higher standards.
The discussions of higher-standards-based education led to implementation of
the No Child Left Behind Act of 2001 and its requirements for high-stakes testing
and accountability. A high-stakes test is any test that results in a decision about a stu-
dent or school that can change a student's or school's status (e.g., graduation from
high school; admittance into a college; and a school that comes under State
Department of Education oversight for poor performance). Almost all states now re-
quire students to pass tests as part of high school graduation requirements. In addi-
tion, students are assessed at identified grade levels from 3rd grade through high
school to meet the requirements of the No Child Left Behind Act. Both students
and schools are feeling the increased pressure to perform well on the assessments, lest
the school fail to meet annual yearly progress for five years in a row and risk being
reconstituted (i.e., being put under external control, leading to the possible replace-
ment of administration, staff, curriculum, etc.). Many laud the intent of ensuring
that all children learn and achieve to high academic levels. However, many educators
are also concerned that the focus on assessment competes with the focus on learning.
Numerous professional organizations have weighed in on the high-stakes testing
issue. The American Counseling Association (ACA) appointed a Task Force on
High-Stakes Testing in 2003 and some of the areas considered by this task force are
particularly noteworthy. In a position statement adopted by the ACA Governing
Council (ACA, 2005b), the task force recognized the importance of assessment and
accountability and its relationship to high achievement. (This position statement
may be found on the companion website for this text, in the chapter on high-stakes
testing.) High-stakes testing (HST) is one objective means of assessing student per-
formance, and HST assessments are generally well developed. However, the task
force specified some important cautions. Using a single test score resulting from a
group administration of the test to make decisions about individual students has in-
herent problems; many students are at a disadvantage on HST, and the results may
not accurately reflect their abilities. The task force points out that special education
law does not allow decisions to be made about children based on a single test, but the
accountability provisions of HST do allow this type of decision making. While ac-
countability remains a major requirement for schools and school systems, it must be
balanced with providing assessment tools for students that truly assess what they
should know in a way that maximizes student performance and reflects best prac-
tices in assessment.
Individuals With Disabilities Education Improvement Act
of 2004 (IDEIA) and Related Legislation
The Education for All Handicapped Children Act, also known .is PL 94-142, was
initially enacted in 1975 alter a long struggle to equalize the opportunities for dis-
abled students and to provide opportunities similar lo those ot their nonhandi-
capped peers through a tree, appropriate education in the least restrictive environ
Foundations of Assessment 85
ment. This special education law has been reauthorized several times since its enact-
ment, renamed the Individuals With Disabilities Education Act (IDEA) in 1990,
and most recently signed by President Bush on December 3, 2004 as the Individuals
With Disabilities Education Improvement Act (IDEIA). The bill outlines the
process for referring, assessing, identifying, placing, and instructing students with
handicapping conditions who warrant additional services under the law. The law re-
quires that all decisions are made by a multidisciplinary team that includes the par-
ents, special educator, regular educator, school system representative, and frequently
the professional school counselor and school psychologist. Parental consent is re-
quired for assessment and placement activities. The multidisciplinary team makes all
placement and educational decisions; each eligible child is required to have an
Individual Education Plan (IEP), which outlines the goals for the child and the serv-
ices that will be provided.
Part B, Section 614 (2) (3) of IDEIA outlines the requirements for conducting
the evaluation to determine if a child has a handicap. It states that the local educa-
tion agency (i.e., school system) shall
■ use a variety of assessment tools and strategies to gather relevant functional, de-
velopmental, and academic information, including information provided by the
parent, that may assist in determining if the child is a child with a disability and
the content of the IEP;
■ not use any single measure or assessment as the sole criterion for determining
whether a child is a child with a disability or determining an appropriate educa-
tional program for the child;
■ use technologically sound instruments that may assess the relative contribution
of cognitive and behavioral factors, in addition to physical or developmental fac-
tors; and
■ ensure that assessments and other evaluation materials used to assess the child
■ are selected and administered so as not to be discriminatory on a racial or
cultural basis;
■ are provided and administered in the language and form most likely to yield
accurate information;
■ are used for purposes for which the assessments or measures are valid and
reliable;
■ are administered by trained and knowledgeable personnel; and
■ are administered in accordance with any instructions provided by the pro-
ducer of such assessments.
The above language clearly delineates requirements that are actually best prac-
tices in assessment and which are discussed earlier in this chapter and in other chap-
ters of this book. This reauthorization of the law strengthened the development of
new approaches to determine whether students are learning disabled that are not
based solely on the IQdiscrepancy model (see Chapter 12, Table 12. 1). Additionally,
the law focuses on addressing the problem of the over- and misidentification of lin-
guistic and cultural minority students and directs districts with significant over rep-
resentation of minorities to create and operate programs to reduce this problem (see
www.cec.sped.org/law_res/doc/law/index.php or further details).
86 Chapter 2
The Health Insurance Portability and Accountability Act
of 1996 (HI PA A)
Privacy issues of the general citizenry regarding medical and mental health fields are
of critical importance. The rise of managed care, frequent switching of health insur-
ance plans by employers, and the sensitive nature of questions frequently asked by
these entities often lead to privacy concerns. The Health Insurance Portability and
Accountability Act of 1996 (HIPAA) required that the U.S. Department of Health
and Human Services (HHS) adopt national standards for the privacy of individually
identifiable health information, outlined patients' rights, and established criteria for
access to health records. Included in this law was a provision that HHS must adopt
national standards for electronic healthcare transactions. In response to this man-
date, regulations named the Privacy Rule were adopted in 2000 and became effec-
tive in 2001. This rule set national standards for the protection of health informa-
tion as it applied to health plans, health clearing houses, and healthcare providers
who conduct transactions electronically. All covered entities had until April 14,
2003, to comply with the Privacy Rule (see http://www.hhs.gov/ocr/hipaa for fur-
ther details).
The HIPAA Privacy Rule has a number of provisions, including giving patients
the right to obtain and examine a copy of their health records and request correc-
tions, allowing patients some ability to control the uses and disclosures of their
health information, allowing patients to know how their information might be used
and if disclosures have been made, setting limits on the use and release of health
records, and providing a complaint process. The Privacy Rule also requires that
providers give clients a privacy notice and should obtain a signed acknowledgement
of this notice.
States and health entities continue to work on the details of the implementation
of HIPAA. Clearly, it has implications for professional counselors, particularly those
who work in health settings, clinics, agencies, and private practice. Professional
counselors must be aware of this law and its requirements and ensure that their prac-
tices are in accordance with its provisions. Importantly, the laws apply whether the
client is a self-payer or the professional counselor receives payment through insur-
ance companies or health organizations. Professional counselors should also be sure
to adhere to HIPAA provisions when client information is shared.
HIPAA protects health information much the same way FERPA protects stu-
dent records and information. While the USDE has indicated that FERPA will con-
tinue to regulate student information in schools, the schools are finding that HIPAA
has complicated the process. Schools frequently depend on assessments conducted
by nonschool providers, particularly lor handicapped students, who are regulated by
HIPAA. In past years the assessments and health information would routinely be-
come part of the child's educational record. What schools are now finding is often
documents are stamped with "do not redisclose" or other indications that informa-
tion should not be made a permanent part of the educational record of the child and
must be returned to the assessor if the child leaves the school. As healthcare providers
and patients become more aware of the requirements of HIPAA, these issues will
likely be resolved.
Foundations of Assessment 87
It should be noted that the mandates of HIPAA are consistent with ethical stan-
dards and therefore should not be a barrier to sound professional practice. Signed,
informed consent; limits to disclosure; and the confidentiality of patient informa-
tion are all part of the ethical standards and should drive the practice of professional
counselors.
Guidelines of the Equal Employment Opportunity Commission (EEOC)
According to Kaplan and Saccuzzo (2001), the government exercises its power to
regulate testing largely through interpretations of the 14th Amendment to the
Constitution, which guarantees all citizens due process and equal protection under
the law. This is evidenced by the government's actions concerning personnel prac-
tices, particularly employee testing. Title VII of the Civil Rights Act of 1964 and its
subsequent amendments created the Equal Employment Opportunity Commission
(EEOC), whose guidelines outlaw discrimination in employment based on race,
color, gender, national origin, religion, pregnancy, gender, age 40 and above, or sta-
tus as a Vietnam veteran.
The EEOC developed guidelines for the use of tests and assessments in employ-
ment practices. The commission was particularly interested in any procedures that
might have an adverse impact on selection and worked to ensure that tests and as-
sessments were not used to discriminate based on race. It ruled that any assessment
used as a basis for employment decisions that adversely affected hiring, promotion,
transfer, or any other activity protected by the law constituted discrimination unless
the test was validated for the reason it was being used and the person handling the
personnel matter could not use other procedures (Drummond, 2000).
Following the Civil Rights Act, a number of U.S. Supreme Court cases chal-
lenged the concept of adverse impact and refined employment practices. The first
landmark case was Griggs v. Duke Power Company (1971). The case involved several
African American employees of the power company who sued because they felt the
criteria used for promotion (a high school diploma and two tests) were discrimina-
tory. In this case, and in the subsequent cases of Albemarle Paper Company v. Moody
(1975) and Washington v. Davis (1976), the U.S. Supreme Court's decisions placed
the burden of proof on the employer. The decisions indicated that employment tests
must be valid and reliable, and forced the employers to define how job performance
relates to test scores (Kaplan & Saccuzzo, 2001).
A 1988 U.S. Supreme Court's decision in Watson v. the Fort Worth Bank and
Trust Company involved an African American woman who was passed over for pro-
motion for a supervisory position at the bank. She argued that racial minorities were
underrepresented in selections for higher-level jobs. The court ruled that by adding
one subjective item to objective tests, employers could protect themselves from dis-
crimination suits as adverse impact does not apply to subjective criteria. This ruling
was followed by Wards Cove Packing Company v. Antonio in 1989. This case was filed
by cannery workers at an Alaskan packing company who claimed that the company
was keeping them out of higher-paying and more skilled jobs. The U.S. Supreme
Court refused to hear the case and remanded it back to the lower court. In so doing,
they noted that the burden of proof should be shifted to the plaintiff to demonstrate
88 Chapter 2
that there are problems with selection procedures. This ruling obviously favored em-
ployers as few employees have the resources and knowledge necessary to prove bias
in personnel practices.
As a result of these cases, Congress passed the Civil Rights Act of 1991, which
incorporated many of the principles of the Griggs v. Duke Power Company case. The
act placed the burden of proof back on the employer and outlawed differential cut-
off scores or score adjustments.
The Americans With Disabilities Act of 1991 (ADA)
Just prior to the enactment of PL 94-142 in 1975 to address the needs of school-
aged youth with educational handicaps, Congress passed the U.S. Rehabilitation Act
of 1973. Section 504 of this act contains important provisions for individuals with
medical or mental disorders (see Chapter 12 for a fuller discussion of this act). Some
of the implications of the U.S. Rehabilitation Act of 1973 and related legislation in-
volved the requirements for access by handicapped citizens to ramps and elevators
in public buildings, as well as handicapped parking spaces and curbs cut to allow
wheelchair access. These landmark laws were added to by important new laws in
the 1990s. The Americans With Disabilities Act (ADA) of 1991 was enacted to re-
move barriers for persons with physical and mental disabilities to employment, ed-
ucation, and public services. The law requires that reasonable accommodations must
be made for persons who are determined to be impaired, including accommodations
in testing and assessments. The law does not delineate what accommodations are re-
quired, so they must be determined on a case-by-case basis. There are tremendous
concerns in the assessment community regarding how to provide accommodations
and fairly assess individuals with disabilities without compromising the reliability
and validity of the assessment instruments. Murphy and Davidshofer (2001) sug-
gested that this issue will occupy test developers for years to come. Of course, the
implications of the ADA go far beyond assessment. But for now, realize that
Americans with disabilities are given full protection under the law, and that reason-
able accommodations must be offered.
Court Decisions Related to Diversity in Assessment
Tests have long been used to "sort" and "select." When certain groups are over- or
underselected for participation in programs, the specters of bias and fairness will cer-
tainly arise. There have been a number of court cases involving the use of testing in
education, decisions that have shaped amendments to special education law and
practice. The first major case that examined the validity of psychological test scores
was Hobson v. Hansen (1967). Students in the District or Columbia public schools,
which were integrated, were placed in classes based on the results or group ability
tests, which resulted in establishment of a de facto segregation or tracking system.
Hobson, the parent of two children, sued the school system, arguing that African
American students were tracked into the basic track while White students were
placed in the honors and other tracks. The U.S. Supreme Court found that ability
Foundations of Assessment 89
tests that had been developed on White students could not be used to place African
American students (Kaplan & Saccuzzo, 2001). Current test development standards
specify that students about whom decisions will be made must be well represented in
a test's standardization sample. The Hobson v. Hansen case brought this point home
very clearly. In fact, present-day test developers owe much of their "commonsense"
procedures to the issues resolved by early pioneers in test development and civil
rights cases.
The case of Diana v. State Board of Education concerned the use of intelligence
tests for bilingual Mexican American students. The plaintiffs argued that bilingual
Mexican American students were inappropriately placed in classes for the educable
mentally retarded (EMR) based on tests that failed to take into account their bilin-
gual status. The students retested in Spanish all scored too high to meet the EMR
criteria. An out-of-court agreement established that bilingual students would be
tested in both English and their native language; that placement in EMR classes
would be based on both test scores and a comprehensive developmental assessment
of the child; and that tests that emphasize areas that might be unfair to minority chil-
dren could not be used for placement. Again, today we see this issue as "common
sense," but in thel960s and 1970s, almost all tests were published in English. Today,
greater diversity in languages occurs to test clients.
Chapter 10 explores the interaction of race and socioeconomic status on intel-
lectual development, an issue that, however, was not widely studied in the 1960s and
1970s. The placement of African American students in EMR classes based on IQ
tests was at the heart of the California case Larry P. v. Riles (1979). The plaintiffs con-
tended that use of these intelligence tests was invalid for African American students
and that IQ tests should therefore not be employed for placement purposes. Many
testing experts testified at the trial, some in support of the validity of IQ tests for
African American and other children, others in opposition to the use of such tests.
The judge in the case ruled that the "tests are racially and culturally biased, have a
discriminatory impact on African American children, and have not been validated
for the purpose of (consigning) African American children into educationally dead-
end, isolated, and stigmatizing classes" (Kaplan & Saccuzzo, 2001, p. 580). This de-
cision was appealed but upheld in 1984. As a result, intelligence tests could not be
used to place African American students in special education classes. This ban was
expanded in 1986 to include testing all African American children for special edu-
cation in California but does not apply to other minority children. Enter the law of
unintended consequences. Because qualification for special education services re-
quired assessment of ability (i.e., intelligence), these laws virtually eliminated minor-
ity students from qualifying for services intended to help them. Some civil rights ad-
vocates viewed special education as a way to "segregate" minorities within the
educational system, but the alternative of failing children with disabilities was seen
as even more egregious. As a result of a subsequent case, Crawford v. Honig, the ban
on testing African American children was lifted in 1992. Legislation related to test
bias and discrimination against clients of diverse backgrounds has become a source
of hot debate within the assessment field, so the final pages of this chapter are dedi-
cated to an exploration of this essential foundational issue.
90 Chapter 2
DIVERSITY ISSUES IN ASSESSMENT
For years, U.S. Census data have indicated that the United States is becoming in-
creasingly diverse. The U.S. population is multiracial, multiethnic, and multilingual.
Approximately 7% of this population reported in the 2000 Census having a disabil-
ity, and 18% reported living below the poverty level (U.S. Census Bureau, 2003).
These demographics demand that professional counselors be able to work effec-
tively with clients and students from a multitude of cultures (Constantine, 2001;
Lee, 2001). Professional counselors are involved in numerous ways in administering
and ensuring that clients receive appropriate assessment. This section discusses some
of the basic aspects of diversity in assessment and provides professional counselors
with practical steps to approaching fairness in assessment in the clinical and school
settings.
Understanding Diversity
Conversations of diversity often focus on race and ethnicity. Race is an anthropolog-
ical construct based on the classification of physiological characteristics (Gladding,
2001) and includes a political and socioeconomic dimension related to differences in
physical appearance (Brace, 1995; Yee, Fairchild, Weizmann, & Wyatt, 1993).
Ethnicity is the "group classification in which members believe they share a common
origin and a unique social and cultural heritage such as language or religious belief"
(Gladding, 2001, p. 45). While important, these two factors alone do not describe
the extent of diversity that professional counselors face.
Culture is another important diversity issue. Culture is both complex and mul-
tidimensional. Professional counselors may recognize several cultures and subcul-
tures within a population. Although this adds to the complexity of the construct,
understanding and appreciating culture and its multidimensionality gives profes-
sional counselors valuable insight into their clients' sense of self, language and com-
munication patterns, dress, values, beliefs, use of time and space, relationships with
family and significant others, food, play, work, and use of knowledge (Whitefield,
McGrath, & Coleman, 1992). Succinctly, culture can be described as the set of "val-
ues, beliefs, expectations, worldviews, symbols, and appropriate behaviors of a group
that provide its members with norms, plans, and rules for social living" (Gladding,
2001, p. 34).
Diversity also encompasses gender, sexual orientation, language, socioeconomic
status, ability, and disability. Diversity simply means difference: difference in the
many aspects and dimensions used to help understand student development and be-
havior. The professional counselor must therefore appreciate and understand diver-
sity in all of its manifestations and its implications for assessment.
For well over 40 years, the counseling profession has been deeply concerned
about appropriate assessment for clients and students of diverse populations
(Anastasi & Urbina, 1997; Sattler, 2001; Whiston, 2005). Some of this discussion
has resulted from legislation and legal proceedings regarding the specific areas of
multidisciplinary assessment, assessment in a clients native language, assessment
Foundations of Assessment 9 1
used for selection purposes, assessment procedures, informed consent, and due rights
notification (Rogers, 1998; Sattler 2001). Ethical guidelines also address appropri-
ate assessment. Beyond global charges to respect diversity and work in the best inter-
est of students and clients, Section E of the ACA Code of Ethics (2005a) specifically
addresses diversity in testing. Further direction regarding diversity in assessment,
however, is delineated in the Association for Assessment in Counseling and
Education's Standards for Multicultural Assessment (AACE, 2003b).
Standards for Multicultural Assessment
Recognizing the importance of multicultural assessment, the Association for
Assessment in Counseling and Education (AACE) studied and compiled standards
of many professional organizations. The result was a document outlining 68 compe-
tencies specific to the assessment and counseling of diverse populations (see
http://aace.ncat.edu). The competencies cover assessment content and purpose;
norming, reliability, and validity beyond general standards issues; administration and
scoring; and interpretation and application of assessment results. Many of the com-
petencies have significant consequences for professional counselors, psychologists,
and other diagnosticians involved in psychological assessment and placement
processes. In addition, professional counselors should be aware of the competencies
because of their relationship to culturally appropriate counseling and assessment
services (AACE, 2003b).
Diversity Factors Involved in Assessment
Thus far we have outlined the mandate for professional counselors to be aware of
legal and ethical responsibilities regarding multicultural assessment. Following is a
more specific discussion of the ways in which diversity factors affect assessment.
Difference
Inherent in the concept of diversity is an understanding of difference. Difference
does not imply better than or worse than. Difference may, however, become ad-
vantage or disadvantage in the realm of assessment. Imagine that all mental health
professionals are asked to take an assessment on providing services to clients.
Professional counselors, psychologists, psychiatrists, family therapists, and social
workers (among others) gather to take the test. Clearly, each of these groups of pro-
fessionals differs in training, credentials, experience, and perhaps views of clients.
No group is better than the other. The groups are different. If the assessment is
based largely on the Council for Accreditation of Counseling and Related
Educational Programs (CACREP) (2001) curricular standards using the language
and orientations of professional counselors, professional school counselors may
have an advantage on the test. Their scores may be somewhat higher than those of
psychologists, psychiatrists, family therapists, or social workers. This simplistic ex-
ample demonstrates how cultural difference and test content can interplay. In more
92 Chapter 2
subtle ways (e.g., test words, pictures, format), test content must be examined to
ensure that information specific to certain cultures is controlled in assessment
(Rogers, 1998).
Worldview
Worldview is a second factor involved in assessment. Every aspect of counseling oc-
curs in a cultural context. This includes assessment. As a result of cultural context,
assessment can be undergirded by cultural worldviews that are unique to a specific
culture and unfamiliar or offensive to another. Worldview includes beliefs, values,
perspectives, and perceptions (Whiston, 2005). The rather common practice of
timed assessment deserves consideration in respect to worldview. In America, speed
is often valued. Think of Americans' fascination with fast food, microwaveable prod-
ucts, instant messaging, and turbo-charged cars. In many other cultures, however,
speed is not valued. Reflection is considered sacred. Given this difference in world-
view, it is not hard to see why a 4th-grade student new to this country may not score
well on a timed multiplication test, even though the student may have mastered
multiplication facts.
Acculturation and Language
Acculturation and language are additional diversity factors involved in assessment.
Acculturation is a change process that occurs when an individual of one culture
comes in contact with an individual or individuals of another culture. As a result of
this process, individuals may take on different values, beliefs, and behaviors
(Drummond, 2004; Fouad & Chan, 1999). The degree and rate of change depend
on a number of factors, including power dynamics, issues of immersion, and indi-
vidual personality characteristics (e.g., cultural identity development status, genera-
tional status). Professional counselors often have the opportunity to work with
clients, students, and families dealing with various stages of acculturation. It is not
unusual for a professional school counselor to work with a student who, due to the
school setting and peer interactions, is bicultural yet lives in a family setting that
largely maintains the traditions and practices of the student's native culture. A cul-
turally competent counselor is prepared to recognize and effectively handle the coun-
seling implications these issues of acculturation may have for students and clients.
A growing number of Americans have the ability to communicate in more than
one language, but proficiency in the languages they speak may vary considerably
(Rogers, 1998). This growing phenomenon affects assessment in many interesting
and diverse ways. Language is more than words and pronunciation. Language in-
cludes structure, nuance, denotation, and connotation. These components preclude
the simple use of translations or unstandardized forms of a test (Fouad & Chan,
1999; Whiston, 2005). For example, it is inappropriate to have a staff member sim-
ply translate a test for Spanish-speaking clients. The process of ensuring that the
"translation" is equivalent to the original test involves sophisticated statistical and
content analyses that extend beyond the scope of this chapter. It is important to note
that it is equally inappropriate to assume that a Caribbean student who has just
Foundations of Assessment 93
moved to this country must take a test with no accommodation simply because the
child has always been educated in "English." Differences in sentence structure, word
meaning, idioms, and nuance may affect the student's ability to perform on the test.
Of course, there are times when mastery of the English language is the objective of
the test. In these cases, students and clients should be given the opportunity to
demonstrate their understanding of the language by being tested in the given lan-
guage. When English competence is not the issue, however, the language considera-
tions discussed must be examined. Although professional counselors are not often
involved in developing assessments, in their role as advocates, they must ensure that
issues of language are fully explored when assessing students and clients with limited
English proficiency, bilingual abilities, or multilingual capabilities.
Socioeconomic Status
Research suggested that socioeconomic status is a significant factor in assessment
(Flanagan, 1993). Herring (1997) suggested that social class is the most important
factor affecting the counseling process. Furthermore, there is a line of research de-
scribing the confounding issues of race, ethnicity, and social class (Fouad & Chan,
1999). Although social class cuts across all races and ethnicities, poverty dispropor-
tionately affects clients and families of color. When these findings are merged with
census data regarding poverty rates, it becomes evident that professional counselors
must be aware of issues of socioeconomic status and assessment. Socioeconomic sta-
tus is about much more than money. Social class may affect students' values, world-
views, emotional resources, and support systems (Payne, 2003).
Student and Client Factors
A host of student and client factors, including test-taking attitudes, experience and
capabilities, motivation, and social desirability, can affect assessment (Cohen &
Swerdlik, 1999; Drummond, 2000; Fouad & Chan, 1999). These factors, discussed
in much greater detail in Chapter 8, are unique to the individual and may change
from test situation to test situation. For example, it is conceivable that a teenager
may perform better on a multiple-choice test on social studies vocabulary than a
true-false test on the same vocabulary. The difference in performance may be due
only to the student's familiarity with the test format. Or clients with a visual impair-
ment may have their test performance greatly diminished by their Braille and key-
boarding skills rather than their knowledge of the material. Additionally, it is not dif-
ficult to imagine a situation in which clients give the answer they feel the professional
counselor wants, or the answer that is the most socially desirable. There is a strong
literature base that suggests social desirability is a significant issue in assessment for
many groups of clients (Marin & Marin, 1991). This factor may differentially affect
cultural groups.
Traditionally, professional counselors work with clients and students individu-
ally and in small groups on decreasing test anxiety and strengthening test-taking
strategies. Professional counselors should also provide direct student and client serv-
ice and work as advocates to address these and other factors affecting assessment.
94 Chapter 2
Counselor and Examiner Factors
Counselor and examiner factors comprise a final category of diversity issues that
must be considered in assessment. Counselor and examiner factors include profes-
sional competence; comfort with the assessment process; perceptions and worldview;
race, ethnicity, and culture; and social influence. These important issues are ad-
dressed in the Standards for Multicultural Assessment (AACE, 2003b):
Culturally competent counselors have training and expertise in the use of traditional
assessment and testing instruments. They not only understand the technical aspects of
the instruments but also are aware of the cultural limitations. This allows them to
use test instruments for the welfare of clients from diverse cultural racial, and eth-
nic groups.
Selection of Assessment Instruments: Content and Purpose
Culturally competent counselors have knowledge about their social impact on others.
Interpretation and Application of Assessment Results
BIAS IN ASSESSMENT
Some or all of the factors discussed in the preceding section can result in assess-
ment bias. Standardization samples may also affect bias (Reynolds & Brown,
1984). According to Whiston (2005, p. 211), bias "refers to the degree that con-
struct-irrelevant factors systematically affect a group's performance." Construct-
irrelevant factors are those facets not related to the idea being assessed. An assess-
ment or test item is said to be biased when "empirical evidence shows that it is
more difficult for one group member than another, the general ability level of the
two groups is held constant, and no reasonable rationale exists to explain the group
difference on the same items" (Drummond, 2000, p. 356). Three types of bias —
content bias, internal structure bias, and predictive bias — have particular implica-
tions for diverse populations.
Content Bias
Content bias refers to test material being more familiar to one group than another.
Our earlier example involving professional counselors, psychologists, psychiatrists,
family therapists, and social workers provides a simplistic illustration of content bias.
Content bias is often less obvious when affecting multicultural populations, how-
ever. Content bias may involve hidden messages or values of a culture that are not
readily visible due to cultural encapsulation. Consider two well-documented items
from the Wechsler Intelligence Scale for Children — Revised (WISC-R) (Kaplan &
Saccuzzo, 2001; Sattler, 2001). One question asks, "What would you do if you were
sent to buy a loaf of bread, and the grocer said he did not have any more?" Another
question, which has been subject to much controversial investigation, asks, "What
should you do if a child smaller than you begins to fight with you?" Although re-
search findings differ, these questions appear to contain embedded cultural values,
behaviors, and norms that may not hold consistent over all multicultural groups
Foundations of Assessment 95
(Hardy, Welcher, Mellitis, & Kagan, 1976; Koh, Abbatiello, & McLoughlin, 1984;
Sandoval, Zimmerman, & Woo-Sam, 1983). Student responses to these questions
may not be a measure of intelligence, but rather a measure of cultural values, behav-
iors, and norms.
Internal Structure Bias
Predictive Bias
Scores on an assessment may be reliable for one group, but not reliable for another.
Or scores on an instrument may be more reliable for one group than another. This
phenomenon is called internal structure bias. Internal structure bias can be due to
norming factors or the underlying factor structure of an instrument. In light of this,
some assessment instruments report differences between groups of test takers. For
example, an assessment instrument may report differential reliability data based on
gender, age, or ethnicity.
A test can also be biased if it systematically over- or underpredicts a group's perform-
ance. This type of bias is called predictive bias. Many professional counselors are fa-
miliar with debate about the ability of standardized assessments like the Scholastic
Assessment Test-I (SAT-I) to predict students' performance in college (McCornack
& McLeod, 1988). "Gifted and talented" testing and success in special accelerated
educational programming embody another common area of concern regarding pre-
dictive bias. Generally, predictive bias is investigated along the lines of gender, race,
and ethnicity.
Interpreting Test Scores With Caution
Some test manuals and texts, including this one, use the phrase "interpret with cau-
tion" to warn readers about possible problems with the interpretations of scores. So
what does the warning actually mean? In the context of this discussion on diversity,
it usually means that we don't know the consequences of interpreting the score for a
given individual with diverse characteristics. For example, some tests have norms
that undersampled participants from various cultural backgrounds. If test norms un-
dersampled African Americans, for instance, interpretations of an African American
client's score may result in some inaccuracies. Unfortunately, without extensive em-
pirical study, it is often extremely difficult to determine what the possible effects of
undersampling may be. Empirical studies often explore differences between partici-
pants with diverse characteristics and provide helpful conclusions about whether
scores generated by the test yield appropriate inferences about the examinee. While
it is best practice to use tests that will yield reliable and valid scores for the individ-
ual being tested, often such tests either do not exist or are suspect for individuals
with certain characteristics. So when you encounter the phrase "interpret with cau-
tion," it may have several different potential meanings, but the phrase always should
be taken into account when making decisions about the client's life.
96 Chapter 2
Ensuring Fairness in Assessment
Test bias is a critical and alarming issue. Nonetheless, tests and other forms of assess-
ment do have an important role in educational and clinical settings. Sattler (2001)
suggested that good assessment offers an objective standard, reveals disparity, ap-
praises functioning, obtains appropriate programming, and evaluates programs. All
of these functions of assessment are significant to the professional counselor's work
with clients and students. How, then, can the professional counselor work to ensure
fairness in testing? The question is complex and multifaceted. The following sugges-
tions offer some initial strategies, interventions, and recommendations:
■ Remember that the professional counselor's primary responsibility is the welfare
of all clients. Ensure that the focus of any and all assessment is to benefit the
client.
■ Engage in professional development opportunities (e.g., continuing education
and training) to continue to learn about self, multicultural counseling, and diver-
sity in educational and clinical issues and settings.
■ Continually monitor and challenge personal belief systems and attitudes regard-
ing all aspects of diversity.
■ Demonstrate competence in multicultural counseling knowledge, skills, and be-
liefs. Employ culturally sensitive approaches when working with clients and
families.
■ Abide by the ACA Code of Ethics (2005a) and other pertinent standards, includ-
ing the Standards for Multicultural Assessment (AACE, 2003b) and the
Multicultural Counseling Competencies and Standards (Sue, Arredondo, &
McDavis, 1992).
■ Become familiar with assessment instruments and procedures for the given pop-
ulation. As appropriate, become fully competent in all aspects of administration,
interpretation, and application of assessment results.
■ Do not attempt to use assessment procedures outside of your scope of ex-
perience.
■ Refer students and clients for assessment as warranted.
■ Consult with other mental health professionals, including clinical psychologists,
school psychologists, and social workers, to become familiar with the ways they
use assessment to serve clients.
■ Test clients and students in the appropriate language. Use only translations with
established validity.
■ Use only valid and appropriate test adaptations and modifications. Do not as-
sume that counselor- or teacher-made changes arc appropriate without first con-
sulting the test manual.
■ Consult with special educators, school psychologists, and other specialists to en-
sure that students receive appropriate test accommodations. Accommodations
may include changes in setting, scheduling, timing, presentation, or response
format (Spinelli, 2002).
■ Use multiple assessment methods to gain a more complete picture or a client or
student.
Foundations of Assessment 97
Clarify test purpose, procedure, and expectations to clients and students.
Provide individual and group counseling support for stress and anxiety related to
assessment as needed.
Provide individual and group counseling support for motivation and test prepa-
ration as needed.
Actively advocate for continued research on culturally appropriate assessment
and counseling intervention for all clients and students.
SUMMARY/CONCLUSION
This chapter has discussed various historical, ethical, legal, and diversity issues in as-
sessment, and provided resources for understanding how best to use assessment re-
sults in clinical practice. However, because legislation and litigation are an ongoing
process, professional counselors must stay updated on current issues in assessment
and must also continuously assess their behavior to ensure that it meets the highest
ethical standards. Best practices in assessment are really ethical and legal practices.
This chapter has also highlighted and summarized key events in the evolution
of assessment, from its historic roots to its current ethical concerns. Knowledge of
such events and issues helps present-day professional counselors to understand the
context for today's concerns, both within the profession and in society at large.
Today, professional counselors are involved in a variety of ways in ensuring that
clients and students receive quality assessment. Legal and ethical standards mandate
that all clients receive assessment that is appropriate, unbiased, and meaningful. This
mandate challenges professional counselors to understand the implications of diver-
sity and assessment, and all that is involved in administering culturally competent
assessment and in interpreting results. With this charge in mind, assessment can
offer useful and important information for diverse client populations.
KEY TERMS
acculturation
achievement
aptitude
bias
career assessment
case law
clinical assessment
code of ethics
confidentiality
content bias
culture
diversity
Family Educational Rights and
Privacy Act (FERPA)
Health Insurance Portability and
Accountability Act (HIPAA)
high-stakes testing (HST)
Individual Education Plan (IEP)
Individuals With Disabilities
Education Improvement Act
(IDEIA)
informed consent
intelligence
internal structure bias
laws
multicultural assessment
No Child Left Behind Act (NCLB)
personality assessment
98 Chapter 2
policy regulation
predictive bias socioeconomic status
Protection of Pupil Rights vocational development
Amendment (PPRA) worldview
Cli Ar fc r
3
Reliability |
by Dimiter Dimitrov
Reliability of scores is a critical issue in measurement. This chapter reviews
basic principles in reliability, such as classical test theory and standard error
of measurement in classical test theory. It also discusses the types of reliabil-
ity commonly used by test developers, including internal consistency, test-retest, al-
ternate form, criterion-referenced, and interscorer reliability. Finally, the concepts of
attenuation and reliability of composite scores are discussed. Advanced concepts of
dependability and generalizability of scores are included on the companion website
for this text.
WHAT IS RELIABILITY?
Reliability means consistency. Measurements in the physical sciences can often be
conducted with great precision (e.g., millimeters, grams). However, measurements in
counseling, education, and related fields are not completely accurate and consis-
tent — and are sometimes far from it. There is always some error involved, usually
due to a person's conditions (e.g., mood, fatigue, momentary distraction) and/or ex-
ternal conditions (e.g., noise, temperature, light), that may randomly occur during
the measurement process. The way instruments of measurement (e.g., tests, inven-
tories, or raters) are designed or the way questions or items are phrased may also af-
fect the accuracy of the scores (observations).
For example, it is unlikely that the scores of a person on two different forms of
an anxiety test would be equal, because differently worded items often yield varying
results. Also, different scores are likely to be assigned to a person when different pro-
fessional counselors evaluate a specific attribute of the person (e.g., introversion,
99
TOO Chapter 3
sociability, self-esteem). In another scenario, if a group of people takes rhe same test
twice within a short period of time, one can expect the rank order of their scores on
the two test administrations to be somewhat similar, but not exactly the same. In
other words, one can expect a relatively high, yet not perfect, positive correlation of
test-retest scores for this group of examinees. As still another example, when it comes
to making placement decisions about clients, inconsistency may occur in different
criterion-referenced classifications (e.g., pass-fail group labels or mastery-nonmas-
tery group labels) based on measurements obtained through testing or subjective
judgments of raters (e.g., teachers, parents).
In measurement parlance, the higher the accuracy and consistency of measure-
ment scores, the higher the reliability. The reliability of scores indicates the degree
to which they are accurate, consistent, and repeatable when (a) different people con-
duct the measurement, (b) different instruments are used that purport to measure
the same trait (e.g., proficiency, ability, attitude, anxiety), and (c) there is incidental
variation in measurement conditions (e.g., lighting, seating, temperature). In other
words, reliable scores are produced by tests that are free from errors of measurement.
Reliability is a key indicator of quality measurements with tests, surveys, inventories,
or individuals (e.g., raters, judges, observers). Most important, reliability is a neces-
sary (albeit not sufficient) condition for the validity of measurements. Validity refers
to the meaningfulness, accuracy, and appropriateness of interpretations and decisions
based on measurement data. Thus if professional counselors cannot measure a client
characteristic consistently (reliability), they cannot make accurate interpretations
(validity).
It is important to note that reliability refers to the scores obtained with a test
and not to the instrument itself. Previous studies and recent editorial policies of pro-
fessional journals (e.g., Dimitrov, 2002; Sax, 1980; Thompson & Vacha-Haase,
2000) emphasize that it is more accurate to refer to "reliability of measurement data"
than to "reliability of tests" (e.g., items, questions, tasks). Tests cannot be accurate,
stable, or unstable, but observations (scores) can be (i.e., tests are neither reliable or
valid, but scores on tests can be). Therefore, any reference to reliability of a test
should be interpreted to mean the reliability of scores derived from the test.
As is discussed in Chapter 4, the most important characteristic of any measure-
ment is its validity — that is, the degree to which scores lead to meaningful and ap-
propriate interpretations. To allow for such interpretations, however, the scores
should be accurate and consistent (i.e., reliable). The criterion-related validity of an
entrance examination, for example, is assessed by the correlation between the exam-
inees' scores on this test and their scores on a criterion (e.g., grade point average at
the end of the first academic year). However, under the classical model of reliability,
a criterion-related validity coefficient of test scores cannot exceed the square root of their
reliability. More simply put, the reliability of scores predetermines a "ceiling" for the
validity of a test's scores. How closely this ceiling will be approached depends on
other factors as well. But at this point it is essential to understand that reliability is a
necessary, but not sufficient, condition for validity. That is, high validity can occur
il lest scores are highly reliable but cannot occur if test scores have low levels of reli-
ability. On the other hand, just because test scores are highly reliable does not mean
Reliability 101
they will have high validity. For example, just because you can measure your height
consistently (high reliability) does not mean that height indicates intelligence (low
validity).
THE CLASSICAL MODEL OF RELIABILITY
True Score
Scores on performance tests, personality inventories, expert evaluations, and even
physical measurements are not completely accurate, consistent, and repeatable. For
example, although the height of a person (i.e., one's "true height") remains constant
throughout repeated measurements within a short period of time (say, 15 minutes)
using the same scale, the observed values would be scattered around this "true
height" due to the equipment being used or imperfection in the visual acuity of the
measurer (whether the same examiner or somebody else). Thus, if T denotes the per-
son's constant true height, then the observed height (X) in any of the repeated meas-
urements will deviate from Twith an error of measurement (E). That is,
X=T+E (3.1)
In classical test theory, one often refers to a client's observed score {X, the score
the client received on a test) and the client's true score ( T, the score the client would
have received if the test and testing conditions were free of error [£). Thus, if E =
(i.e., there is no error), the observed score is the true score (i.e., if E= 0, then X= T).
To grasp what is meant by true score in classical test theory, imagine that a per-
son takes a standardized intelligence test each day for 100 days in a row. The person
would likely obtain a number of different observed scores over these occasions. The
mean of all observed scores would represent an approximation of the person's true
score ( T) on the standardized intelligence test. In general, the true score is the aver-
age of the (theoretical) distribution of scores that would be observed in repeated in-
dependent measurements of a person with the same test. Importantly, the true score
(T) is a hypothetical concept, for it is not practically possible to test the same person
infinity times in independent repeated measurements because each testing could in-
fluence the subsequent testing (i.e., practice effects, memory effects).
It is important to note that the error in Equation 3.1 is assumed to be random
in nature. Possible sources of random error are (1) fluctuations in the mood or
alertness of persons taking the test due to fatigue, illness, or other recent experi-
ences; (2) incidental variation in the measurement conditions due, for example, to
outside noise or inconsistency in the administration of the instrument; (3) differ-
ences in scoring due to factors such as scoring errors, subjectivity, or clerical errors;
and (4) random guessing on response alternatives in tests or questionnaire items.
Conversely, systematic errors that remain constant from one measurement to an-
other do not lead to inconsistency and therefore do not affect the reliability of the
scores. Systematic errors will occur, for example, when one professional counselor
assigns 2 points lower than another professional counselor to each person in a
1 02 Chapter 3
group of examinees. So, again, the reliability of any measurement is the extent to
which the measurement results are free of random errors. Random error affects relia-
bility; systematic error does not.
Classical Definition of Reliability
Equation 3.1 represents the classical assumption that any observed score {X) consists
of two parts: true score ( T) and error of measurement (E). Because errors are random,
it is assumed that they do not correlate with the true scores (i.e., r TE = 0). Indeed,
there is no reason to expect that persons with higher true scores would have system-
atically larger (or smaller) measurement errors than persons with lower true scores.
Under this assumption, Equation 3.2 is true for the variances (o 2 ) of observed scores,
true scores, and errors for a population of test takers:
G^=o\+ol (3.2)
that is, the observed score variance (<3y) is the sum of true score variance (g\) and
error variance (fj |). Given this, the reliability of measurements, r^, indicates what pro-
portion of the observed score variance is true score variance. The analytic translation of
this definition is
o
r xx ~
_ "T _
(3.3)
"x
The definition of reliability implies that the reliability takes values from 0.00 to
1.00. The closer r^x is to 1.00, the higher the reliability, and, conversely, the closer
t^x ' s to zero » the lower the reliability. Perfect reliability (rxx = 1 -00) can theoretically
occur when the total observed score variance is true score variance (cj x = g t) or '
equivalently, when the error variance is zero (rj \ = 0).
In general, reliability coefficients in the 0.80s are desirable for screening tests,
0.90s for diagnostic decisions (Salvia & Ysseldyke, 2004). Reliabilities of less than
0.80 indicate substantial error variance and subsequent inconsistent conclusions.
This is not to say that scores based on rxx * 0-^0 cannot be helpful for hypothesis
generation (exploring problems or strengths in areas of client functioning); for hy-
pothesis validation (confirming suspected problems or strengths in areas of client
functioning); or for instruments used in research studies for the purpose of defining
a construct (e.g., self-efficacy, anxiety). However, important decisions about a client's
life should be based on more consistently derived information.
Standard Error of Measurement (SEM)
Classical test theory also proposes two additional assumptions: (a) that the distribu-
tion of observed scores that a person may obtain under repeated independent test-
ings with the same test is normal, and (b) that the standard deviation of this normal
distribution, referred to as the standard error of measurement (SEM), is the same
for all persons taking the test. Figure 3.1 represents a hypothetical normal distribu-
tion of observed scores for a person with a true score of 20 for a specific test. The
Reliability 103
Figure 3.1 Theoretical distribution of observed scores for
repeated independent testings of one person with the
same test
mean of the distribution is the person's true score (T= 20), and the standard devia-
tion is the standard error of measurement (SEM = 2).
Based on the statistical properties for normal distributions, about 95% of the
scores fall in the interval from 2 standard deviations below the mean to 2 standard
deviations above the mean. In Figure 3.1, this is the interval from T— 2{SEM) to
T + 2{SEM), which in this case is from 16 to 24 [i.e., 20 - 2(2) to 20 + 2(2)]. This
property can be used to construct (approximately) a 95% confidence interval of a
person's true score ( T) falling within the given observed score (X) range based on the
person's performance in a single testing:
X- 2{SEM) <T<X + 2{SEM)
(3.4)
For example, if 23 is the person's observed score in a single real testing (X= 23),
then the true score of this person is expected (with about 95% confidence) to fall in
the interval from 23 - 2(SEM) to 23 + 2{SEM). This range of scores within which
the true score probably lies is called a confidence interval because it gives the degree
of confidence an examiner can expect regarding whether the client's true score lies
within the given interval. In this example, with SEM '= 2, the 95% confidence inter-
val for the person's true score is from 23 - 2(2) to 23 + 2(2), or from 19 to 27.
When it comes to understanding and using confidence intervals, it is useful to
know that (a) about 68% of all possible observed scores in Figure 3.1 fall in the in-
terval from T- l(SEM) to T + \(SEAf) — i.e., from 18 to 22 in this case; (b) about
95% of all possible observed scores in Figure 3.1 fall in the interval from T- 2{SEM)
to T + 2(SEM) — i.e., from 16 to 24 in this case; and (c) almost all (99.7%) of the
observed scores in Figure 3.1 are in the interval from T- 3(SEM) to T + 3(SEM),
1 04 Chapter 3
which in this case is from 14 to 26. You may have noticed that these percentages (i.e.,
68%, 95%, 99.7%) are the same percentages under the normal curve used in the
discussion of standard deviation. This is because the SEM is, in effect, the standard
deviation for the individual, with the individual's true test score standing at the cen-
ter and the SEM serving as the "personal standard deviation," based on the test score
reliability coefficient.
A smaller SEM will produce smaller confidence intervals for the person's true
score, thus improving the accuracy of measurement. Also, because the SEM is in-
versely related to reliability, high reliability indicates high accuracy of measurements
(lower SEM). SEMs are much more helpful than reliability coefficients when report-
ing client test scores. The reliability coefficient is a unitless number between and 1
conveniently used to report reliability in empirical studies. But the SEM relates di-
rectly to the meaning of the test's scale of measurement (e.g., raw number-righr
score, deviation IQ score, T score, z-score) and is therefore more useful for score in-
terpretations (e.g., Feldt & Brennan, 1989; Thissen, 1990). The SEM is related to
the reliability, r xx> and the standard deviation of the observed scores, as follows:
5£M = O xx /l-r xx . (3.5)
To compute the SEM, one needs to know the reliability and standard deviation
of the client's test score. For example, if the reliability is 0.90 and the standard devi-
ation of the client's observed scores is 15 (such as is the case for the deviation IQ, a
standard score scale with an M = 100 and SD = 15 — a scale commonly used in in-
telligence and achievement tests), then the standard error of measurement is
SEM = 1 5>/l-0.9 = 1 5(0.3 162) = 4.743.
Some test manuals leave it to the test user to compute the client's confidence in-
terval, sometimes providing only reliability coefficients; others provide confidence
intervals in norm conversion tables. Professional counselors understand that even
though it is often necessary to make decisions about clients based on an observed or
obtained score, it is not appropriate to interpret a single observed score to a client.
Instead, it is appropriate to report and interpret the range of scores within which the
true score probably lies.
Furthermore, it is most appropriate to interpret these scores at the 95% level of
confidence (± 2 SEM). Some test manuals and computer scoring programs recom-
mend interpretation at the 68% level of confidence, which means that the client's
true score will fall outside the suggested range in 1 out of every 3 reports (i.e., the
68% level results in an average "mistake rate" of 32%!). Most clinicians (and clients)
find it unacceptable to be wrong in one 1 of every 3 decisions — especially decisions
related to diagnosis and treatment. Using the 95% level of confidence (± 2 SEM)
means that the true score falls in the reported range 95 out of 100 administrations.
A 5% error rate is much more acceptable in clinical practice, especially when mak-
ing decisions about peoples' lives that may influence treatment for months or years
into the future.
Consider the following examples of how to apply SEM to score interpretation.
If a client's full-scale [Q (FSIQ) score on the WAIS-II1 is 1 10, and the SEM is equal
Reliability 105
to 4 standard score points, the client's IQ could be interpreted at the 95% level of
confidence (± 2 SEM) as 1 10 ± 8 (e.g., 2x4). Thus, on 100 alternative-form ad-
ministrations of the WAIS-III, the client's FSIQ would probably fall within the FSIQ
range of 102-1 18 about 95 times. This means that the professional counselor may
have 95% confidence that the client's true IQ score falls between 102 and 118 (also
referred to as the Average to High Average range). Likewise, the client's Couriers'
Adult ADHD Rating Scales (CAARS) (Conners, Erhardt, & Sparrow, 1999) DSM-
TVinattention scale T score of 71, with an SEM = 3 points (T score units), would be
interpreted at the 95% level of confidence (±2 SEM) as 71 ± (2 x 3) = 71 ±6. Thus,
on 100 alternate form administrations of the CAARS DSM-IV inattention scale, the
client's T score would probably fall within the T score range of 65-77 about 95
times.
Think About It 3.1 If a client's observed score on the MMPI-2
Depression scale is a T score (M = 50, SD =10) of 67, and the scale's reliabil-
ity is 0.82, what is the client's likely range of scores at the 95% level of confi-
dence? Given this information, would you be inclined to support a diagnosis
of depression for this client? Explain.
TYPES OF RELIABILITY
The reliability of test scores for a population of examinees is defined as the ratio of
their true score variance (7") to observed score variance (see Equation 3.3).
Equivalently, the reliability can also be represented as the squared correlation be-
. tween true and observed scores (i.e., r^x = r XT ). Unfortunately, in empirical research,
true scores cannot be directly determined. Thus the reliability is typically estimated
by coefficients of internal consistency, test-retest, alternate forms, and other types of
reliability estimates adopted in the measurement literature. It is important to em-
phasize that different types of reliability relate to different sources of measurement
error and, contrary to common misconceptions, are generally not interchangeable.
Internal Consistency
Internal consistency estimates of reliability are based on the average correlation
among items within a test or scale. A huge advantage of internal consistency is that
participants need to receive only one administration of a single test on a single occa-
sion. A widely known method for determining internal consistency of test scores is
split-half reliability. Using the split-half method, the researcher literally divides the
questions into two halves, either by an odd-even method or by some other strategy.
Each half of the items is treated as a separate test, and the total scores of these two
half-tests for each participant are correlated together. With this method, the two
halves are assumed to be parallel (i.e., the two halves have equal true scores and equal
error variances).
106 Chapter 3
However, because halving the number of items on a test substantially lowers
the correlation (i.e., all other things being equal, the greater the number of items,
the higher the correlation — thus halving the number of items lowers the correla-
tion), an estimation formula is required to predict what the internal consistency
of the items would be if returned to the size of the original complement of items.
The score reliability of the whole test is estimated using the Spearman-Brown
Prophecy formula:
(3.6)
'XX ~ l + r,,
where r 12 is the Pearson correlation between the scores on the two halves of the test.
For example, if the correlation between the two test halves is 0.6, then the split-half
reliability estimate is: r^ = 2(0.6)/(l + 0.6) = 0.75.
The Spearman-Brown Prophecy formula can also be used to determine the
likely result of adding more items to a given scale. Following on the example above,
if the number of test items yielding the internal consistency coefficient of 0.75 were
doubled yet again (this is what the value 2 in the numerator designates), the result-
ing reliability coefficient would be r^ = 2(0.75)/(l + 0.75) = 1.50/1.75 = 0.83.
How one splits the items into two equivalent halves when computing internal
consistency is very important. One commonly used approach to forming test halves,
called the odd-even method, is to assign the odd-numbered test items to one half and
the even-numbered test items to the other half of the test. This method is particu-
larly appropriate when the items are presented in order of increasing difficulty, such
as on an achievement or intelligence test. Perhaps an even more appropriate method
would be to stagger the assignments to even out the item difficulty levels (i.e., sum
items 1 , 4, 5, 8, 9 versus items 2, 3, 6, 7, 10).
A more recommended approach, called matched random subsets, involves three
steps. First, two statistics are calculated for each item: the proportion of individuals
who answered the item correctly (i.e., the item difficulty) and the point-biserial cor-
relation between the item and the total test score. Second, each item is plotted on a
graph using these two statistics as coordinates of a dot representing the item. Third,
items that are close together on the graph are paired, and one item from each pair is
randomly assigned to each half of the test.
Computer programs, such as SPSS, are frequently used to compute internal
consistency estimates. Researchers and test users should use caution to ensure that
proper item matching procedures were used, lest the computer default to a proce-
dure that will overestimate a scales internal consistency, leading to undue confidence
in score reliability. Importantly, if the instrument consists of different scales yielding
interpreted scores, internal consistency should be estimated for each scale. For ex-
ample, the Disruptive Behavior Rating Scale (DBRS) (Erford, 1993) is composed of
four subscales: Distractible, Oppositional, Impulsive-Hyperactive, and Antisocial
Conduct. There is no interpretable total score, and each subscale score is interpreted
as a separate subscale. Thus internal consistency coefficients for the observed scores
on each scale arc ot interest.
Reliability 107
The Spearman-Brown Prophecy formula is not appropriate when there are in-
dications that the test halves are not parallel (e.g., when the two test halves do not
have equal variances). In such cases, the internal consistency of the scores for the
whole test can be estimated with the Cronbach's coefficient (X (Greek letter alpha)
using the formula (Cronbach, 1951):
2[VAR(X)-VAR(A-.)-VAR(X 2 )] ,- _.
a = VmF) -' (37)
where VAR(X)> VAR(A r 1 ), and VAR^Q represent the sample variance of the whole
test, its first half, and its second half, respectively. For example, if the observed score
variance for the whole test is 40 and the observed variances for the two test halves are
12 and 11, respectively, then coefficient alpha (a) = 2(40 - 12 - 1 1)/40 = 0.85.
The coefficient alpha is usually calculated for more than two components of the
test, and when item response formats are multiscaled (e.g., Very Dissatisfied,
Dissatisfied, Satisfied, Very Satisfied; or Almost Never, Sometimes, Frequently,
Almost Always). Each test component is an item or a set of items. Sometimes it is
helpful to see the mathematical formulas to understand what comprises OC. But if you
find this confusing, don't worry. Computers do all of these computations nowadays
in a split second, using programs such as SPSS.
The general formula for alpha (see Equation 3.8) is simply an extension of
Equation 3.7 for more than two test components:
J IVAR(^)1 ( }
W »-l|_ VAR(A-) J' KJ '
where n is the number of test components (usually the number of items), X t is the
observed score on the ith test component, VAR(^i) is the variance of X;, X is the ob-
served score for the whole test (i.e., X = X! + X 2 + . . . + XJ, VAR(J0 is the variance
of X, and Z (Greek capital letter sigma) is the summation symbol.
When each test component is a dichotomously scored item (1= correct [or true],
= incorrect [or false]), the coefficient a can be calculated by an equivalent formula,
called Kuder-Richardson formula 20 (see Equation 3.9), with the notation KR-20 (or
(X-20) for the coefficient of internal consistency:
KR-20 = ^[l-^ji], (3.9)
where n is the number of test items, X is the observed score for the whole test,
VAR(J0 is the variance of X, p t is the proportion of persons who answered correctly
item i, and/»j(l - p) is the variance of the observed binary scores on item i (Xj = 1
or 0)— that is, VAR(A|) =p l (l - p).
Again, high-speed computer programs, such as SPSS, make the computation of
coefficient (X, or KR-20, rather simple.
Recall from Chapter 1 that speeded tests are those on which few clients miss any
items, but the score is determined by how many items a client finishes in a given pe-
riod of time. With a speed test, the split-half correlation coefficient ordinarily would
108 Chapter 3
be close to zero if the test were split into the first half of items versus the second half
of items, since most examinees would correctly answer almost all items in the first
half and (running out of time) would miss most items in the second half of the test.
Likewise, if the odd-even splitting method is used for a speeded test, the resulting
correlation would be artificially high because clients usually would get all items cor-
rect up until the point at which time ran out, and all subsequent items would be
marked incorrect. Thus the score for odd items would almost always be within 1
point of the even-item total. When determining the internal consistency of speeded
tests, it is generally appropriate to split the test by time intervals, rather than items,
and to combine the raw scores for these intervals into the two test halves. For exam-
ple, on the WISC-IV's Coding subtest, one could observe how many items were re-
sponded to correctly during each of the eight 15-second intervals that comprise the
2-minute subtest. Then the number of items correctly responded to during the odd
(1st, 3rd, 5th, and 7th) 15-second intervals could be summed and correlated with
the sums of the even (2nd, 4th, 6th, and 8th) 15-second intervals for each partici-
pant in the study.
Test-Retest Reliability
The extent to which the same persons consistently respond to the same test, inven-
tory, or questionnaire administered on different occasions is known as the test-retest
reliability of test scores. Sometimes test-retest reliability is also called temporal stabil-
ity, meaning stability over time. Test-retest reliability is estimated by the correlation
between the observed scores of the same people taking the same test twice; that is,
the same participants take the same test on two separate occasions. The resulting cor-
relation coefficient is also referred to as the coefficient of stability, because the primary
source of measurement error is stability over time. Because tests are frequently used
to track therapeutic progress or the effects of medication, test-retest reliability can
provide helpful insights into how client scores are likely to vary simply due to a read-
ministration of the same test on a second occasion.
The major problem with test-retest reliability estimates is the potential for car-
ryover effects between the two test administrations. Readministration of the test
within a short period of time (e.g., a few days or weeks) may produce carryover ef-
fects due to memory and/or practice. For example, students who take a math or vo-
cabulary test may look up some answers they were unsure of after the first adminis-
tration of the test, thereby changing their true knowledge on the content measured
by the test. Likewise, the process of completing an anxiety inventory could trigger an
increase in the anxiety level of some people, thus causing their true anxiety scores to
change from one administration of the inventory to the next. This happens if the
client is more or less anxious on a second administration of the anxiety inventory.
If the construct (attribute) being measured varies over time (e.g., cognitive
skills, depression), a long period of time between the two administrations of the
instrument may produce carryover effects due to biological maturation, cognitive
development, or changes in information, experience, .md/or moods, for example,
Reliability 109
if a student learns a lot about math between the first and second administration of
a math achievement test, the student's score may increase substantially. Likewise,
a client with depression who is administered the Beck Depression Inventory — Second
Edition (BDI-II) (Beck, Steer, & Brown, 1996) may receive a lower score on the
second administration of the BDI-II six months later, regardless of whether treat-
ment was successful.
Thus, test-retest reliability estimates are most appropriate for measurements of
traits that are stable across the time period between the two test administrations (e.g.,
visual or auditory acuity, personality, work values). In addition to problems with car-
ryover effect, there is also a practical limitation to retesting, because it is usually time
consuming and/or expensive. For many tests, retesting solely for the purpose of es-
timating score stability may be impractical, although it is frequently of interest to
clinicians using tests as an outcome measure to know what degree of consistency to
expect on test readministration.
On a final note, researchers should always report the time interval between the
first and second administrations of the test. This is because, normally, the longer the
period of time between the two administrations, the lower the reliability (e.g., the
greater the chances that some external factor or developmental change will occur).
Alternate Forms Reliability (Equivalent Forms Reliability)
One way of counteracting the practice effects that occur in test-retest reliability is to
design two equivalent versions of a test. If two versions of an instrument (test, inven-
tory, or questionnaire) have very similar observed score means, variances, and corre-
lations with other measures, they are called alternate forms or equivalent forms of
the instrument. In fact, any decent attempt to construct parallel tests is expected to
result in alternate test forms, as it is practically impossible to obtain perfectly paral-
lel tests (i.e., equal true scores and equal error variances). Alternate forms usually are
easier to develop for instruments that measure, for example, abilities and aptitudes
or specific academic abilities because of the larger potential item pools (i.e., domains
of knowledge) than those that measure constructs that are more difficult to repre-
sent with measurable variables (e.g., personality, motivation, temperament, anxiety).
Thus professional counselors will frequently see alternate forms of achievement tests
(i.e., Forms A and B of the WJ-II1 ACH [Woodcock, Mather, & McGrew, 2001] and
the Blue and Tan forms of the WRAT-III [Wilkinson, 1993]), but they only rarely
see alternate forms purposefully designed by a test author in the intellectual, behav-
ioral, or personality domains.
Alternate form reliability is a measure of the consistency of scores on alternate
test forms administered to the same group of individuals — that is, two equivalent
tests administered to the same participants on two separate occasions. The correla-
tion between observed scores on two alternate test forms, referred to as the coefficient
of equivalence, provides an estimate of the reliability of each of the alternate forms
based on item content, scorer, and temporal stability. Just as with the test-retest reli-
ability coefficients, the estimates of alternate form reliability are subject to carryover
110 Chapter 3
(practice) effects, but to a lesser degree, as the persons are not tested twice with the
same items. To minimize carryover effects, a recommended rule of thumb is to have
a 2-week time period between administrations of alternate test forms.
Whenever possible, it is important to obtain both internal consistency coeffi-
cients and alternate forms correlations for a test. If the correlation between alternate
forms is much lower than the internal consistency coefficient (e.g., a difference of
0.20 or more), this might be due to (a) differences in content, (b) subjectivity of
scoring, and (c) changes in the trait being measured over time between the adminis-
trations of alternate forms. To determine the relative contribution of these sources of
error, it is usually recommended to administer the two alternate forms on the same
day for half a sample of respondents, and then after a 2-week time interval for the
other half of the sample (so long as the number of participants in each group is at
least 10 or more for empirical purposes). If the correlation between the scores on the
alternate forms for the same-day administration is much higher than the correlation
for the 2-week time interval, then variation in the trait being measured is a major
source of error (i.e., temporal instability). For example, it is likely that measures of
mood will change over a 2-week time interval, and thus the 2-week correlation will
be lower than the same-day correlation between the alternate forms of the instru-
ment. However, if the two correlations are both low, the persons' scores may be sta-
ble over the 2-week time interval, but the alternate forms probably differ in content.
Likewise, when scores on alternate forms of an instrument are assigned by raters
(e.g., counselors, parents, teachers), one may check for scoring subjectivity by using
a three-step procedure: (1) randomly split a large sample of persons; (2) administer
the alternate forms on the same day for one group of people; and (3) administer the
alternate forms after a 2-week time interval for the other group of people. If the cor-
relations between raters are high for both groups, there is probably little scoring error
due to subjectivity. If the correlation over the 2-week time interval and the same-day
correlation are both consistently low across different raters, it is difficult to deter-
mine the major sources of scoring errors. Such errors can be reduced by training the
raters in using the instrument and by providing clear guidelines for scoring behav-
iors or traits being measured.
Reliability of Criterion-Referenced Tests
Criterion-referenced measurements show how the examinees stand with respect to
an external criterion. The criterion is usually some specific educational or perform-
ance objective, such as "can apply basic algebra rules," "is able to recognize patterns,"
or even "is at risk for depression."
Most teacher-made tests are criterion referenced because the teacher is more in-
terested in how well students master coursework (criterion referenced) rather than
how students did when compared with other students (norm referenced). Likewise,
professional counselors frequently want to know whether a client has "enough" of a
mental disorder (depression, anxiety oppositional behavior) to warrant a diagnosis.
This is also a situation calling for criterion-referenced measurement. Because a
criterion-referenced test may cover numerous specific objectives (criteria), each
Reliability 1 1 1
Table 3.1 Contingency table for mastery-nonmastery classifications
Form B
Master
Nonmaster
Master
Pu
Pu
Pm
Form A
Nonmaster
Pn
Pll
Pm
Pm
objective should be measured as accurately as possible. When the results of criterion-
referenced measurements are used for classifications related to mastery or nonmas-
tery of the criterion, the reliability of such classifications is often referred to as clas-
sification consistency. This type of reliability shows the consistency with which
classifications are made, either by the same test administered on two occasions or by
alternate test forms.
Two classical indices of classification consistency are (a) P = the observed pro-
portion of persons consistently classified as mastery versus nonmastery and (b)
Cohen's K (Greek letter kappa) = the proportion of nonrandom consistent classifica-
tions. Their calculation is illustrated for the two-way data layout in Table 3. 1 , where
the entries are proportions of persons classified as masters or nonmasters by two al-
ternate test forms of a criterion-referenced test (Form A and Form B). Specifically,
p n is the proportion of persons classified as "mastery" (those who mastered the con-
tent to the specified level) by both test forms; p n 1S the proportion of persons clas-
sified as "mastery" by Form A and "nonmastery" by Form B;/> 2 i r° r "nonmastery" of
Form A and "mastery" on Form B; and p 22 as "nonmastery" on both forms of the
test. Also, P Al , Pf^, P B1 , and P B2 are notations for marginal proportions — that is:
^Al =PU + Pl2> P Vl = P\\ + p2V P A2 =Pl\ +p 2 2>* ndP K2 = P\2 + P2V The observed
proportion of consistent classifications (mastery/nonmastery) is
P o=Pu + p22
(3.10)
However, P can be a misleading indicator of classification consistency, because
part of it may occur by chance. Cohen's kappa (see Equation 3.11) takes into account
the proportion of consistent classification that is theoretically expected to occur by
chance, P e , and provides a ratio of nonrandom consistent classifications
l- P.
(3.11)
where P e is obtained by summing the cross-products of marginal proportions in
Table 3.1: P e = ^ai^bi + P h2 P m- 1° Equation 3.1 1, the numerator (P - P e ) is the
proportion of nonrandom consistent classification being detected, whereas the de-
nominator (1 -P e ) is the maximum proportion of nonrandom consistent classifica-
tion that may occur. Cohen's kappa indicates, then, what proportion of the maxi-
mum possible nonrandom consistent classifications is found with the data.
1 1 2 Chapter 3
Think About It 3.2 Administering a substance abuse screening test
along with a DSM-IV-TR diagnostic process, let us assign specific values to
the proportions in Table 3. 1 (see Table 3.2): p x x = 0.3, p X2 = 0-2, p 2 \ =0.1,
and/>22 = 0-4- These are nice even numbers, meaning that 30%, 20%, 10%,
and 40% of the cases (decisions) fell into each category, respectively. The
marginal proportions are: P A1 = 0.3 + 0.2 = 0.5, Pp^ = 0.1 + 0.4 = 0.5,
P B] = 0.3 + 0.1 = 0.4, and P B2 = 0.2 + 0.4 = 0.6.
Table 3.2 Contingency table for mastery-nonmastery classifications for
identifying individuals with substance abuse
Form B— DS/W-diagnosis
Diagnosed
Not diagnosed
Form A
Substance
Abuse
Test
Diagnosed
Not diagnosed
0.3
0.1
0.2
0.4
0.3 + 0.2 = 0.5
0.1 +0.4 = 0.5
0.3 + 0.1 =0.4
0.2 + 0.4 = 0.6
With these data, calculate the observed proportion of consistent classi-
fication P Q . You should have gotten P Q = 0.3 + 0.4 = 0.7 by using Equation
3.10.
Next, calculate K using Equation 3.1 1. The proportion of consistent
classifications that may occur by chance in this hypothetical example is: P e =
(0.5)(0.4) + (0.5)(0.6) = 0.5. Using Equation 3.1 1, the Cohen's kappa ratio
is: k = (0.7 - 0.5)/(l - 0.5) = 0.2/0.5 = 0.4.
Finally, interpret these results. For this example of using a substance
abuse test, the initially obtained 70% of observed consistent classifications
(P = 0.7) is reduced to 40% consistent classifications after taking into ac-
count consistent classifications that may occur by chance. Because kappa
provides "conservative" estimations of consistency, it is reasonable to report
in this case that the classification consistency is between 0.40 and 0.70 (i.e.,
between K and PJ. Note: For practical purposes, it is recommended to report
both P and Cohens kappa, as the latter is very conservative, thus underesti-
mating the actual rate of consistent classifications. Previous research [e.g.,
Chase, 1996; Subkoviak, 1988] provides some additional procedures for esti-
mating classification consistency, including scenarios with a single test ad-
ministration or prior to the initial application of the test.)
Reliability 1 1 3
Interscorer and Interrater Reliability
The chances of measurement error usually increase when the scores are based on sub-
jective judgments of the person(s) doing the scoring. In general, the less objective
the scoring procedures, the lower the interscorer reliability. Such situations occur,
for example, with classroom assessment of essays or portfolios where the teacher is,
in fact, the "judge" of performance. In another scenario, involving some projective
tests of personality, the scorer (e.g., professional counselor, psychotherapist) should
decide if the person's responses suggest normal functioning or some form of psy-
chopathology. Subjective judgments of raters (experts, judges) are also used for clas-
sification purposes (e.g., to determine a "minimum level of competency" in pass/fail
decisions). In all scenarios of rater-based scoring, it is important to estimate the de-
gree to which the scores are unduly affected by the subjective judgments of the raters.
Such estimation is provided by coefficients of interrater reliability (also called coef-
ficients of interrater agreement).
Depending on the context of measurement, there are different methods of esti-
mating interrater reliability. Frequently used classical measures of interrater reliabil-
ity are the Pearson correlation coefficients, observed proportion of consistent classi-
fication (P Q ) and Cohen's kappa coefficient. The Pearson r is by far the most
commonly used measure of interscorer reliability when scores are interval, as most
test scores (e.g., standard scores) are, or ratio. Otherwise, the two indices of classifi-
cation (P and Cohen's kappa) can be used as estimates of interrater reliability when
two raters (instead of two test forms) classify persons as mastery or nonmastery.
When more than two categories are used by two raters to classify persons (or their
products), one can still use Equation 3.1 1 for Cohen's kappa, but P and P e should
be calculated with a contingency table for the respective number of categories. For
example, with three classification categories (e.g., low, medium, and high perform-
ance), P and P e are calculated as follows: P = p n + p 22 + ^33 and P e = P^\P%\ +
Interrater reliability is also sometimes used to refer to two independent observers
who rate another individual, such as when sets of mothers and teachers rate children
on a behavior rating scale and the results are correlated. This type of relationship is
better described as a type of criterion-related validity (see Chapter 4). In this in-
stance, one set of scores (e.g., teachers) serves as the criterion for the other set of
scores (e.g., mothers). If two raters independently assign scores (say, to portfolios) of
students, then the Pearson correlation coefficient for the two sets of scores can be
used as an estimate of interrater agreement. The higher the correlation coefficient,
the lower the error variance due to scorer differences, and the higher the interrater
agreement.
When scoring of alternate forms of a measurement instrument is done by two
or more raters, one can check for measurement error due to subjectivity of scoring
by administering the alternate forms (a) on the same day for one group of subjects
and (b) with a 2-week delay for another group of subjects. If the correlations between
raters are high for both groups, there is probably little error due to subjectivity of
114 Chapter 3
scoring. If, however, the correlation over the 2-week time interval and the same-day
correlation are both consistently low across different raters, it is difficult to deter-
mine the major source of unreliability (subjectivity of scoring or, say, differences in
content for the two alternate forms of the instrument). The interrater reliability can
be improved by training the raters in the use of the instrument and providing clear
guidelines for scoring (e.g., a more specific rubric or more specific criteria).
Overall, researchers and test users can reduce measurement error and improve
reliability by (1) writing items clearly, (2) providing complete and understandable
test instructions, (3) administering the instrument under prescribed conditions, (4)
reducing subjectivity in scoring, (5) training raters and providing them with clear
scoring instructions, (6) using heterogeneous respondent samples to increase the
variance of observed scores, and (7) increasing the length of the test by adding items
that are (ideally) parallel to those that are already in the test. The general principle
behind improving reliability is to maximize the variance of relevant individual differ-
ences and minimize the error variance.
THE IMPORTANCE OF RELIABILITY
Reliability in Validation
ATTENUATION
The most important characteristic of any measurement is its validity — a concept re-
ferring to the meaningfulness, appropriateness, and usefulness of the inferences
made from the measurement scores. Validation is an ongoing process of gathering
evidence to support such inferences. It is essential to understand that it is the infer-
ences made from measurement scores that are being validated, not the instrument
(e.g., test, survey, or questionnaire) being used to obtain such scores.
The score reliability is an important (necessary, but not sufficient) condition in
the validation process. For example, as noted earlier in this chapter, the reliability of
scores predetermines a "ceiling" for their criterion-related validity, but how closely
this ceiling will be approached depends on other factors as well. The validation of
measurements in counseling usually deals with constructs (e.g., proficiency, motiva-
tion, anxiety, empathy, and beliefs) and involves different types of evidence. The
quality of such evidence depends, among other things, on the reliability of the data
collected from different sources. The reliability also affects the results from correla-
tional analyses and other statistical procedures used in the validation process. The
term attenuation is used to indicate the reduction of the magnitude of such results
due to unreliability of scores.
If the reliability of the scores on two variables A" and Kis not perfect (i.e., r^ ^ 1
and/or r YY * 1), the observed correlation between Xand Y, r XY , is attenuated (i.e.,
lower than the "actual" correlation between the person's true scores on the two vari-
ables: T x and 7~ Y ). One can estimate the correlation between the true scores 7" x and
Reliability 115
7"y by using Equation 3.12, referred to as the correction for attenuation formula
(Spearman, 1904):
'T Y Tv
— r XY
4>
r XX r YY
(3.12)
Think About It 3.3 The correlation between two variables, Self-esteem
(X) and Persistence decisions (Y), in a study on academic persistence for col-
lege undergraduates was found to be r^y = 0.35. Professional counselors in-
volved in this study found also that the reliability of the two measures, ^and
Y, for the study data was relatively low: r^ = 0.68 and ryy = 0.71, respec-
tively. To estimate what would be the correlation between the two variables if
their measurements were perfectly reliable, the professional counselors used
Equation 3.12, thus obtaining much higher correlation (0.50) between the
students' true scores (i.e., no error involved) on Self-esteem and Persistence
decisions:
'T Y Tv
0.35
V(0.68)(0.71)
0.50
Importantly, because perfect reliability is generally not obtainable, one
cannot observe the corrected-for-attenuation correlation values. Such values
indicate the highest correlation coefficients for perfectly reliable scores.
Important conditions for using Equation 3.12 are (1) The reliability esti-
mates, r^x and ryy > should also be accurate and (2) The components in the
right-hand side of Equation 3.12 (r^y, ?xx> and Tyy) should be affected by
the same measurement error — for example, if r^ is estimated when Jifand Y
are measured during one testing session and their internal consistency esti-
mates are used for r^ and r^ in Equation 3.12. However, if r^ and ryy are
alternate form reliabilities, error of measurement involved in their estimation
(due to time lapse and change of test form) would not be involved in the es-
timation of the correlation between .Yand Yir-^). Then Equation 3.12 will
produce an overestimated true score correlation between Xand Y ( r r T ) .
Attenuation effects due to unreliability of data occur also in hypothesis testing
with statistical methods. It should be noted, for example, that although the Pearson
correlation coefficient between an independent variable X and a dependent variable
(criterion) Fis attenuated by error of measurement, the regression coefficient (slope)
in the regression of Y on Xis attenuated by measurement errors in X but not in Y
(Bohrnstedt, 1983). Therefore, particular attention should be paid to the reliability
of the pretest scores when they are used as a covariate (X), say, in the comparison of
treatment groups, using the statistical method analysis of covariance (ANCOVA).
The power of statistical tests is also attenuated by unreliability of the measurement
data (to remind: the power of a statistical test of a null hypothesis is the probability
1 1 6 Chapter 3
that this test will lead to the rejection of the null hypothesis when it is false indeed).
Specifically, the unreliability shrinks the observed effect size (e.g., produced by a spe-
cific treatment), thus reducing the power of the statistical test (for more details, see,
e.g., Cohen, 1988; Maxwell, 1980; Zimmerman & Williams, 1982).
RELIABILITY OF COMPOSITE SCORES
In many situations, scores from two or more scales are combined into composite scores
to measure and interpret a more general dimension (trait, ability, or proficiency) re-
lated to these scales. Composite scores are often used with test batteries for achieve-
ment, aptitude, intelligence, depression, or eating disorders, as well as with local
school measurements such as performance and portfolio assessments. One frequently
reported composite score, for example, is the sum of verbal and quantitative scores
of the Graduate Record Examination (GRE). Another example is the WISC-IV's 10
core subtests, which yield four index scores (i.e., Verbal Comprehension Index
[VCI], Perceptual Reasoning Index [PRI], Working Memory Index [WMI] and
Processing Speed Index [PSI]), which are subsequently combined to yield the full-
scale IQ (FSIQ). The scores on nine scales of the Symptom Checklist-90-Revised
{SCL-90-R) (Derogatis, 1990) are combined into three "global" (composite) scores
in measuring current psychological symptom status. A Total Aggressive Expression
score with the Driving Anger Expression Inventory (DAX) (Deffenbacher, Lynch,
Oetting, & Swaim, 2002) is also obtained as a sum of three scales: Verbal Aggressive
Expression, Personal Physical Aggressive Expression, and Using the Vehicle to
Express Anger. Thus, composite scores are frequently encountered in psychological
and educational testing.
Although the composite score may be simply the sum of several scale scores, its
reliability is usually not just the mean of the reliabilities for the scales being com-
bined. The issue of reliability estimation for composite scores is addressed in this sec-
tion when the composite score is (a) the sum of two scale scores (e.g., GREs, SATs);
(b) the difference score (e.g., gain score for pretest to posttest measurements or the
difference between two independent scorers of a single set of portfolios); and (c) the
sum of three or more scale scores (e.g., WISC-IV, SCL-90-R).
Reliability of Sum off Scores
Let the composite score Kbe the sum of two scale scores, X\ and X 2 : Y= X x + X 2 .
With the GRE scoring, for example, the composite score is the sum of the verbal and
quantitative scores. The reliability of the sum of two scores, ryy. can be estimated as
r YY= l- q ?( 1 - r ") + °^ 1 -^), (3.13)
where af is the variance ofX ]t that is: O^ = VAR(A",), O; is the variance of X 2 , that
is: a| = VAR(A',), Oy is the variance of the composite score Y, that is: Cy = VAR(K),
r M is the reliability of X, and /•,, is the reliability of A',.
Reliability
117
Think About It 3.4 The estimation of the reliability for a composite
score, Y= X x + X 2> is illustrated in this example with data from a study on at-
titudes and behaviors of students related to their sexual activities.
Specifically, X x is the score on a scale labeled "Love as Justification for Sexual
Involvement," and X 2 is the score on a scale labeled "Sex for Approbation."
With the notations adopted in Equation 3.13, the following results were
obtained from the study data for (a) the variances of X x , X 2 , and Y: G x =
13.750, a 2 2 = 10.433, C^ = 38.5992; and (b) the reliabilities of^ and A" 2 :
r u =0.8334, r 22 = 0.8217.
Replacing these components for their values in Equation 3.13, we
obtain:
^=1-
13. 750(1 -0.8334) + 10.433(1 -0.821 7)
38.592
0.892.
Thus, the reliability estimate of the composite score Y (0.892) in this ex-
ample is higher than the reliability estimates of its components, X x (0.8334)
and X 2 (0.8217). While this frequently occurs, it is not always the case. In re-
ality, the larger the difference between r x x and r 22 , and the lower the correla-
tion between the two components (r ]2 ), the less likely that ryy will exceed
each individual component's reliability.
Although not explicitly present, the correlation between X x and X 2 , de-
noted r 12 , affects the reliability of the composite score. When X x and X 2 do
not correlate (r 12 = 0), the reliability of their sum (Y= X x + X 2 ) is simply the
average of their reliabilities: ryy = (r u + r 22 )l2.
In many cases, the scores that are combined into a composite score come from
scales with different units of measurement (e.g., 3-point and 5-point survey scales).
Therefore, to present the measurements on a common scale (and for some technical
reasons), the raw scores are often converted into standard scores (z-scores) before
being summed (this is done, for example, with the raw scores of the primary psycho-
logical symptoms measured with the self-report symptom inventory SCL-90-R). For
the special case of standard (z-) scores, Equation 3.13 is converted into a simpler
form (Equation 3.14):
'YY
1-
2-1
(3.14)
where Gyz is the variance of the sum of the z-scores for X x and X 2 (i.e., Y z =
z x + z 2 ), r xx is the reliability of X x , and r 22 is the reliability of X 2 . Assume that
Oy Z = 3.203 and that r xx = 0.8334 and r 22 = 0.8217. With this, using Equation
3.14, we obtain the value for the reliability of the composite score Y = X x and X 2
(or, equivalently, for Y z = z x + z 2 ):
r YY=i-
2-(0. 8334 + 0.8217)
3.203
0.892.
1 1 8 Chapter 3
Note that Equation 3.14 follows directly from Equation 3.13, taking into ac-
count that the variance of the standard (z-) scores for any variable is 1 and, thus,
CJ 2 (z,) + G 2 (z 2 ) = 2.
Equations 3.13 and 3.14 can be readily extended for cases where the compos-
ite score is a sum of more than two scale scores (e.g., Nunnally & Bernstein, 1994).
For the sum of three scores, for example, the reliability of the composite score Y=
X x + X 2 + X$ can be estimated by extending Equation 3.14 to form Equation 3.15
as follows:
"< (3.15)
'YY
= 1-
,2
'YZ
where CJyz is the variance of the sum of the standard (z-) scores for X x , X 2 , and Xy,
that is, Y z = z x + z 2 + z 5 (r n , r 22 , and r 33 are the reliabilities for X x , X 2 , and X$,
respectively).
Reliability off Difference Scores
The difference between two observers' scores for the same person, called difference
score, is widely used in behavioral research primarily (a) to measure the person's
growth across time points and (b) to compare the person's scores on academic, psy-
chological, or personality variables. For example, measurement of change using the
person's difference (or gain) score from pretest to posttest is used to assess the effect
of specific educational programs, counseling treatments, and rehabilitation services
or allied health interventions, all important facets of outcomes research in the men-
tal health field. Clearly, the quality of the results and the validity of interpretations
in studies on change and profile analysis depend, among other things, on the relia-
bility of difference scores.
Think About It 3.5 The data in this example also come from the study on
attitudes and behaviors of students related to their sexual activities. However,
instead of summing the scores on two scales, the composite score is now the
difference (gain) from pretreatment to posttreatment measurements on a
scale labeled "Self-affirmation"; that is, Y = X 2 - X x , where A", is the pretreat-
ment score and X 2 the posttreatment score on this scale. With the study data,
the variance of the difference Y 7 = z 2 - z x (where z x and z 2 are the standard
score values for X x and X 2 ) was found to be rjy Z = 0.786.
The reliability coefficients {alpha coefficients) for X x and X 2 were r, , =
0.8282 and r 22 = 0.8374, respectively. Using Equation 3.14, the reliability of
the difference scores is
2 -(0.8282 + 0.8374)
'\\
= 1
0.786
= 0.575
Evidently, the reliability of the difference score (0.575) is smaller than
the reliability of the scores entering the difference (0.8282 and 0.8374). As
noted earlier, the reliability of the difference score, r YY , is (implicitly) influ-
enced by the correlation between X x and X 2 (in this case, r 12 = 0.606), be-
cause this correlation affects the value ofOy, in Equation 3.15.
Reliability 119
The use of difference (gain) scores in measurement of change has been
criticized because of the (generally false) assertion that the difference between
scores is less reliable than the scores themselves (e.g., Cronbach & Furby,
1970; Linn & Slindle, 1977; Lord, 1956). This assertion is true, however, if
the pretest scores and the posttest scores have equal variances and equal relia-
bility. When this is not the case, which may happen in many measurement
situations, the reliability of the gain score is reasonably high (e.g., Overall &
Woodward, 1975; Zimmerman & Williams, 1982). The relatively low relia-
bility of gain scores does not preclude valid testing of the null hypothesis of
zero mean gain score in a population of examinees, but it is not appropriate
to correlate the gain score with other variables for these examinees. An im-
portant practical implication is that, without ignoring the caution urged by
some authors, researchers should not always discard gain score and should be
aware when gain scores are useful.
Reliability of Weighted Sums
When different components are of varying importance, but need to be combined
into a composite score, the components must first be "weighted" before being com-
bined. Let the scores from two tests, X x and X 2 , have different "weights" (w x and w 2 ,
respectively) in a composite score, Y= w x X x + w 2 X 2 . To estimate the reliability of the
composite score, Y, given the reliabilities of X x andX 2 , one can (for simplicity) use
the weighted composite score, Yz, of the standardized variables Z x and Z 2 , which are
obtained by transforming the raw scores of X x and X 2 into z-scores. That is,
Y z = w x X x + w 2 X 2 .
With this, the reliability of the composite score, Y(or Y z ), is given by Equation
3.16:
r YY=l-
1-r, , W+ll-r-v
'YZ
(3.16)
where ryyis the reliability of the composite score F(or Yz), r xx is the reliability ofXj,
r 22 is the reliability of X 2 , and Gyz is the variance of the composite score Yz (the
weighed sum of Z x and Z 2 ).
Think About It 3.6 The examination score of counseling students in a
lifespan development course is obtained as a composite score of midterm and
final examinations, with 40% importance assigned to the midterm and 60%
importance to the final examination. The task is to estimate the reliability of
the composite score.
The reliability estimates (Cronbach's alpha coefficients) for the scores on
the first test, X x (midterm), and the second test, X 2 (final), are r xx = 0.72 and
r 22 = 0.80, respectively. Given that the weight for^ is w x = 0.4 (40% impor-
tance) and the weight for X 2 is w 2 = 0.6 (60% importance), the composite
1 20 Chapter 3
score is: Y= (0.4)^ + (0.6)X 2 - After rransforming rhe scores on X x and X 2
into z-scores to obtain the standardized variables Z x and Z 2 , respectively, the
variance of Y z = (0.4)Z ( + (0.6)Z 2 is found to be Gy Z = 1.27. Using Equation
3.16, the reliability of the composite score Fis then
2-(l-0.72)(0.4) 2 + (l-0.80)(0.6) 2
'YY
= 1--
1.27
0.908.
Equation 3.16 can be easily extended to estimate the reliability of a
weighted sum of the scores on more than two tests. In the case of three tests,
for example, the reliability of the composite score Y= u>\X x + w 2 X 2 + WyX^
can be obtained by extending Equation 3.16 to Equation 3.17:
(l-/,, )w ] 2 +(l-r 22 )w\+(\-
r 33
)wj
'YY
J YZ
(3.17)
where Oyz is the variance of Yz = u> l Z l + w 2 Z 2 + w^Zy Equations 3.16 and
3.17 (as well as their extensions for more than three tests) apply equally well
when some of the weights are negative numbers.
SUMMARY/CONCLUSION
This chapter has introduced the concept of reliability, types of reliability, different
methods of estimating reliability, and principles in interpreting and comparing reli-
ability coefficients. Generally, reliability of measurements (e.g., test scores and sur-
vey ratings) indicates their accuracy and consistency under random variations in
measurement conditions, such as a person's conditions (e.g., fatigue or mood) and/or
external sources (e.g., noise, temperature, different raters, and different test forms).
In classical test theory, the true score of a person is defined as the theoretical
mean of the observed scores that this person may have under numerous independ-
ent testings with the same test. A basic assumption is that the examinee's observed
score is a sum of the person's true score and an error (X= T '+ E). Tests with equal true
scores and equal error variances, for any population of examinees, are referred to as
parallel tests. The reliability of test scores is equivalently defined as (a) the correlation
between observed scores on parallel tests, (b) the ratio of true score variance to ob-
served score variance for the same test, or (c) the squared correlation between ob-
served and true scores. Standard error of measurement {SEM) is the standard deviation
of the (assumed normal) distribution of the difference between examinees' observed
scores and their true scores.
Five types of classical reliability were discussed in this chapter: internal consis-
tency, test-retest reliability, alternate form reliability classification consistency, and
interrater reliability.
Internal consistency estimates or reliability are based on the average correlation
among items within an instrument. If the instrument consists of different scales, in-
ternal consistency should be estimated lor each scale. Widely used estimates of inter-
nal consistency are the split-hall reliability coefficient and Cronbach's coefficient
alpha (or its equivalent version, KR-20, for dichotomously scored items). It is always
KEY TERMS
Reliability 121
useful to report the internal consistency of test scores even when other types of reli-
ability are of primary interest. With speed tests, however, it would be misleading to
report estimates of internal consistency.
Test-retest reliability indicates the extent to which persons consistently respond to
the same test, inventory, or questionnaire administered on more than one occasion.
It is estimated by the correlation between the observed scores of the same people tak-
ing the same test twice {coefficient of stability). The major problem with test-retest re-
liability estimates is the potential for carryover effects between the two test adminis-
trations (e.g., due to biological maturation, cognitive development, changes in
information, experience, and/or moods). Thus, test-retest reliability estimates are
most appropriate for measurements of traits that are stable across the time period be-
tween the two test administrations (e.g., personality or work values).
Alternate form reliability relates to the consistency of scores on two alternate test
forms administered to the same group of individuals. It is estimated by the correla-
tion between observed scores on two alternate test forms, referred to also as coefficient
of equivalence. Estimates of alternate form reliability are also subject to carryover ef-
fects. A recommended rule of thumb is to have a 2-week time period between ad-
ministrations of alternate test forms.
Criterion-referenced reliability shows the consistency with which decisions about
mastery-nonmastery of a specific objective (criterion) are made, using either the
same test administered on two occasions or alternate test forms. Widely used classi-
cal indices of classification consistency are the observed proportion of consistent clas-
sifications, P Q , and Cohen's kappa coefficient, which takes into account consistent
classifications that may occur by chance.
Interrater (or interscorer) reliability refers to the consistency (agreement) in sub-
jective judgments of raters (experts, judges) used for classification purposes (e.g., to
determine a "minimum level of competency" in pass-fail decisions) or scoring rubrics
in alternative assessments (e.g., portfolios, projects, and products). Depending on
the measurement case, frequently used estimates of interrater reliability are correla-
tion coefficients, P Q , and Cohen's kappa coefficient (or kappa-Yike coefficients).
Often the person's scores from two or more scales of some instruments are com-
bined into composite scores to measure and interpret a more general dimension (trait
or proficiency) related to these scales (i.e., achievement, intelligence, aptitude, de-
pression). Although the composite score may be simply the sum of several scale
scores, its reliability is usually not just the mean of the reliabilities for the scales being
combined. In this chapter, the reliability for composite scores is addressed for cases
when the composite score is a sum (or difference) of scale scores or a weighted sum
of scores.
alternate form reliability interscorer reliability
confidence interval normal distribution
internal consistency observed score
interrater reliability random error
1 22 Chapter 3
reliability systematic error
speed test test-retest reliability
split-half reliability true score
standard error of measurement
VALIDITY DEFINED
CHAPTER
4
Validity
by Alan Basham and Bradley T. Erford
This chapter focuses on the concept of validity of scores in testing and assess-
ment. It examines how reliability and validity are related and distinct, the dif-
ferent methods by which evidence for validity can be established, and key prin-
ciples professional counselors should apply in determining whether a test is
appropriate for use with a client or group. Methods for making accurate decisions
using a single test or multiple tests are also discussed.
While reliability indicates the degree to which scores on an instrument are measured
consistently, validity considers the degree to which test scores measure what the test
claims to measure. In both cases, test developers attempt to amass evidence that in-
dicates, either logically or through probability, that test scores are trustworthy. In re-
liability, test scores are trustworthy to the degree that they reflect an accurate assess-
ment of some trait or ability, minimizing randomly occurring testing error. Evidence
for validity, however, is concerned with verifying exactly what the test is measuring.
Test results can be trusted, not just because they can be measured consistently, but
because they measured what they were supposed to measure.
Suppose you and a group of friends had an opportunity to demonstrate your
skills at an archery range. Supplied with a bow and several arrows, you each fired at
targets the same distance away. Some people's arrows hit the target, others careened
off nearby trees and rocks, and one person nearly skewered the instructor with a sin-
gularly wild shot. Only you hit the bulls-eye five shots in a row. The instructor,
thinking you might have been just lucky, gives you five more arrows, all of which
123
1 24 Chapter 4
FACE VALIDITY
you calmly sink into the center of the target. Clearly, you are the most reliable archer
in the group, because you keep getting the same result over and over. That's reliabil-
ity, of course. However, if the amazed instructor asks you how you came to be so ac-
ademically gifted, you might be well advised to question the instructor's judgment.
Why? Because your demonstrated consistency at archery has little or nothing to do
with the concept of academic giftedness. Imagine a scholarship program that
awarded grants for tuition in counselor education based on consistency and profi-
ciency of archery scores. Your archery score may be consistent (and therefore reli-
able) but is probably not a reasonable measure of academic potential. The meaning
of the consistent, repetitive bulls-eyes, then, has become a question of validity.
So, the validity of test scores is about two things: (1) what the test actually meas-
ures and (2) how well the test scores measure it (Anastasi & Urbina, 1997). Some
common methods for establishing evidence for validity are described in the
Standards for Educational and Psychological Testing (AERA/APA/NCME, 1999). The
Standards identified three major types of evidence for validity: content-related, cri-
terion-related, and construct-related. While each of these types is distinct in its ap-
proach to demonstrating score validity, it is important not to assume that they are
unrelated to each other. In fact, many test authors use more than one of these tech-
niques to support the validity of test scores. Much of this chapter is devoted to out-
lining these techniques and providing examples. Although face validity is no longer
generally accepted as a legitimate form of validity assessment, a brief discussion of
this type of validity follows.
Face validity is derived from the obvious appearance of the measure itself and its test
items. Items in instruments marked by face validity ask directly for information that
is expected and wanted by the test user. Face validity is quite appropriate for survey
instruments in which the person being queried is responding to questions such as
"What is your age?" or "What is the highest level of education you completed?"
A major problem with self-report tests with high face validity is that when the
trait or behavior in question is one that many people will not want to reveal about
themselves, the likelihood of a truthful (and therefore valid) answer is minimal. A
well-known example of the problem with face validity is that of the Woodworth
Personal Data Sheet (Woodworth, 1920). This first structured personality test was
developed during World War I for use in screening applicants for the military.
Designed to standardize the psychiatric interview, it was based on the incorrect as-
sumption that the content of an item and people's truthful response to it could be
taken at face value. The assessment device included questions such as "I wet the bed"
and "I drink a quart of whiskey every day," to which the person was asked to respond
yes or no. The false assumption that people would answer such questions truthfully
and that they interpreted the questions the same as everyone else essentially made
the test results untrustworthy (Kaplan & Saccuzzo, 2001). Because of these limita-
tions, the Standards (AERA et al., 1 l ) 1 ) 1 )) does not include face validity as a legitimate
type of validity in psychological assessment. 1 lowever, the above information is note-
Validity 1 25
worthy because professional counselors may see the term in other documents. Even
so, for tests in certain domains (e.g., achievement, intelligence), face validity can add
credibility or acceptance to the assessment process.
CONTENT-RELATED VALIDITY
Content-related validity is widely used in educational testing (Kaplan & Saccuzzo,
2001) and in tests of aptitude or achievement. It is used in achievement tests to de-
termine how well an individual has mastered a skill or the content of a course of
study (Anastasi & Urbina, 1997). The main focus in content-related validity is on
how the instrument was constructed and how the content of the test was determined
(Whiston, 2005). The focus on content reflects the examiner's concern with how
well the test items reflect the domain of the material being tested. The term domain
refers to the total informational field from which the items are drawn.
For example, a teacher of U.S. history could write an exam to assess students'
knowledge of the Civil War. The domain of information from which test items
would be drawn is composed of the dates, battles, important persons, sociopolitical
and economic factors, and causes of the war itself. The test would have validity to the
effect that its content reflected all the important aspects of the domain of Civil War
knowledge. A test that asked only about specific battles but ignored persons, causes,
and political outcomes would hardly yield valid test scores of one's comprehensive
knowledge of the U.S. Civil War.
Determining the content validity of a test requires a systematic evaluation of the
test items to determine whether adequate coverage of a representative sample of the
content domain was measured (Anastasi & Urbina, 1997). Obviously, the test can-
not ask questions about all the information in the domain, but it should contain
some items that assess knowledge of each of the domains areas or categories. The do-
main itself should be examined to make sure that all major aspects are covered by
the test, and the test should be constructed so that the number of items from each
category within the domain is consistent with the size and importance of that cate-
gory. Demonstrating how the test is constructed to represent the content of the do-
main provides evidence of content-related validity. The following is another exam-
ple illustrating the concept.
Most professional counselors have taken a graduate course in counseling theo-
ries. Imagine an exam (much like one you have probably come across yourself in your
academic journey) that covered the counseling theories of Freud, Adler, Jung, Ellis,
and Rogers. (Please note that the number of theorists is limited here for the sake of a
manageable example). To create a test that assessed knowledge of these progenitors
and their contributions to the field, the professor would first analyze the important
content areas of the domain. Let's assume the professor divided the overall domain of
each of these five therapeutic pioneers into five subcategories identifying the salient
content of each, including theoretical underpinnings of the model, therapeutic tech-
niques, history of the founder, differences between each model and the others, and
important terms and concepts unique to the model. The professor's organized analy-
sis of the domain would look something like that contained in Table 4.1.
1 26 Chapter 4
Table 4.1 Content analysis of important information
regarding five counseling theorists
Freud
Adler
Jung
Ellis
Rogers
Theory
Techniques
History
Differences
Terms, i.e.,
Id, ego, superego
Theory
Techniques
History
Differences
Terms, i.e.,
inferiority complex
Theory
Techniques
History
Differences
Terms, i.e.,
archetypes, shadow
Theory
Techniques
History
Differences
Terms, i.e.,
catastrophizing,
A-B-C-D-E
Theory
Techniques
History
Differences
Terms, i.e.,
unconditional
positive regard
The professor would then write items reflecting each of the 25 categories listed
above and select items from each category. If the items of the test adequately assessed
some knowledge of each area of the domain, the test would have content-related va-
lidity. However, if the professor asked questions only about the terms of Jungian psy-
chology, the history of Sigmund Freud's life, and the techniques of Rationale-
Emotive Behavior Therapy (REBT) (Ellis), the test items probably would not be
valid measures of the content under study because the questions did not adequately
reflect knowledge of the domain being considered.
CRITERION-RELATED VALIDITY
Criterion-related validity is derived from comparing scores on the test to scores on
a selected criterion. What is a criterion? It is a person's performance score on activi-
ties the test is designed to predict. Specifically a sample of participants in the valida-
tion study has two scores that may be correlated with each other. One is the person's
score on the test being studied, and the other is a score indicating the person's actual
level of ability in the skill or behavior under question as measured by some criterion.
The Scholastic Assessment Test (SAT) and Graduate Record Exam (GRF), for exam-
ple, are used to predict performance in college and graduate school, respectively The
criterion measure for each of these tests is actual academic performance as measured
by grade point average at some point later in the students' academic career. Similarly,
the Armed Services Vocational Aptitude Battery (ASVAB) (USMEPCOM, 2005) is de-
signed to identify the occupational specialties in which military personnel will be
most skilled, given the proper level of training. Job performance in the military is
the criterion measure for the ASVAB.
Anastasi and Urbina (1997) delineate several sources of criterion scores:
■ Academic achievement, such as school grades and achievement test scores.
■ The amount of education a person has.
■ Performance in specialized training, such as music, accounting, or flying airplanes.
■ Job performance, including in business, industry, and the military.
■ Psychiatric diagnosis, which is used especially in development of tests measuring
personality and psychopathology.
Validity 127
■ Ratings by job supervisors, teachers, and others in a position to evaluate the per-
formance effectiveness of subordinates.
■ Correlations with a previously available test, especially when the new test is a sim-
pler form of the original test.
There are two forms of criterion-related validity, predictive criterion-related
validity and concurrent criterion-related validity. The main difference between
the two is when the criterion measure is taken. In predictive criterion-related va-
lidity, the test is administered first, and scores on the criterion measure are col-
lected on the same sample of persons at a later date (i.e., some time in the future).
In concurrent criterion-related validity, the scores on the test and criterion meas-
ure are collected at the same point in time. Let's consider examples of each form
of criterion-related validity.
Suppose that a professional counselor has been asked by a local business owner,
Ms. Schmidlapp, to help her make more accurate hiring decisions at her factory, the
Schmidlapp Widget Company. Ms. Schmidlapp wants the professional counselor to
construct a test that will enable her to select those job applicants who will be most ef-
fective at widget assembly. The professional counselor develops a test believed to help
her make the right choices and conducts the necessary studies to determine that, in
fact, the test scores are quite reliable. However, the professional counselor does not
yet know whether the test scores are valid measures of one's potential as a widget as-
sembler. The professional counselor gives the next 100 job applicants the test, Ms.
Schmidlapp hires them all on a three-month probationary status, and three months
later each new employee is observed to identify the number of flawless widgets assem-
bled in one week. Each employee's score on the test (predictor) is correlated with the
employee's (score on) widget assembly proficiency (criterion). The direction and mag-
nitude of the correlation between predictor and criterion variables tells the profes-
sional counselor the degree to which the test is associated with assembly skill. Because
the criterion measure was collected some time later than the predictor, this study
measured the test's predictive criterion-related validity.
Of course, there are some problems with using this form of validity assessment.
First, the employer has to hire all the applicants in the pool of 100, regardless of abil-
ity or test scores, so that the predictor test accuracy will not be compromised by re-
stricted range. If Ms. Schmidlapp hires only "qualified applicants," she will have cri-
terion scores only on qualified applicants. How will she know if the test will identify
unqualified applicants if the sample will contain no unqualified applicants?
However, hiring everyone can create some major costs in terms of lost productivity
and dissatisfied customers who receive faulty widgets. Thus, the delay between col-
lection of predictor and criterion measures means that the problem the test was de-
signed to resolve continues for that length of time. Second, conducting a time-de-
layed study creates the risk of attrition, in which one may lose some of the original
sample (and their criterion scores) because they quit the job, go on sick leave, or are
rapidly promoted to management.
Concurrent criterion-related validity solves some of these problems but creates oth-
ers. In this scenario, the professional counselor creates the test and conducts reliabil-
ity studies, just as explained above. Then the professional counselor administers the
1 28 Chapter 4
test to all current employees who assemble widgets and assesses their level of pro-
ductivity at the same time. Finally, the professional counselor correlates the scores to
determine the relationship between scores on the test and concurrent efficiency of
widget assembly. As before, if high scores on the test are associated with high efficiency
at widget assembly and low scores with low proficiency, the professional counselor has
established criterion-related evidence for validity, and Ms. Schmidlapp will probably
give the professional counselor a bonus. However, the major problem with the test
scores is that they are likely afflicted with a restricted range in the sample of current
employees. If the test is supposed to identify which job applicants have an aptitude for
widget assembly and which do not, how do we know it will do so when the criterion
measure is derived only from those who can assemble widgets, as evidenced by their
employment? That is, Ms. Schmidlapp has probably already rid her employee pro-
duction line of inefficient widget assemblers, some, no doubt, reassigned to manage-
ment. The advantage of the concurrent method, though, is that there is no long delay
in the construction of the test, with all its real-world adverse effects, and no risk of
attrition.
Perhaps you can readily see how this same scenario would apply to the con-
struction of aptitude and achievement tests as predictors of future performance.
With the SAT, for example, one could give the test to a group of high school stu-
dents and later correlate each student's score to the student's college grade point
average. To be most accurate, though, all the high school students should be ad-
mitted to college, preferably the same college. This presents obvious problems. One
could also give the test to a group of current college students and compare their
test scores to their college grade point averages. The problem, of course, is that of
restricted range, again; only college students are in the sample, but the test is in-
tended to be used with those who are still in high school. The above examples are
of predictive and concurrent criterion-related validity, respectively.
To determine with even greater certainty how valid test scores are and how ac-
curately each predicts future behavior, one can develop a prediction equation repre-
senting the relationship between the predictor and criterion measures and then cal-
culate the standard error of estimate in one's predictions.
Standard Error of Estimate
A correlation coefficient represents the relationship between two variables (A" and
Y). A correlation of 0.87 means the same thing, no matter what variables A" and Y
are. Their relationship in this case is a positive one; as X increases, ^increases. High
scores on A' are associated with high scores on Y; low scores on Aare associated with
low scores on Y. Squaring the correlation coefficient produces the coefficient of de-
termination (r~), the amount of variability in X that is accounted for by the vari-
ability in Y.
Recall also that the relationship between A" and Kcan be represented by a regres-
sion equation:
Y=a + bX (4.1)
Validity 1 29
This equation is the algebraic formula that indicates both the slope and inter-
cept of the line that is closest to all the data points in a scatter diagram or bivariate
data plot. The intercept (a) is the point at which the line crosses the vertical y axis.
The intercept may also be defined as the value of Fwhen X = 0. The slope (b) is the
amount ^increases when X increases by 1.0. Like the correlation coefficient, the
slope and intercept are calculated using the scores in the Xand ^distributions. Once
these statistical values are determined, the equation for the regression line can be
used to predict the value of Fwhen we have a value of X for a given person.
For example, consider a prediction equation derived from two variables, the pre-
dictor {X) and criterion {Y): Y = 5.0 + 0.7X. We can use this quantified relationship
between Xand Tto predict a person's eventual performance on Y (Y') using their
score on our test, X. A person whose score on X = 30 would have a predicted value
of 26 on Y. [Y = 5.0 + (0.7 x 30) = 26].
This process of prediction is widely used in education, business, and industry.
Standards called cutoff scores are often set by those making decisions about hiring,
promotion, and admission to educational and occupational training opportunities.
Those whose predicted performance on the criterion variable is below the cutoff
score are not likely to be selected, while those attaining the highest scores are.
Because decisions affecting people's lives are made using their scores on predictor
variables, it is imperative to have the most accurate tests we can and to know just
how accurate a given test is. The method used to determine the accuracy of predic-
tion is the standard error of estimate.
The standard error of estimate (SE est ) is derived from examining the difference
between our predicted value of the criterion (Y) and the person's actual score on the
criterion (Y). This difference is known as prediction error or residual. Recall that all
test scores contain some degree of random error, and that a reliable test is one that
produces scores that are mostly truth with little error. However, there is no such
thing as a perfectly reliable test. Further, both our predictor and criterion measures
are imperfect, despite our best efforts. Knowing this, it is certain that, even with the
best of criterion and predictor measures, we are destined to be inaccurate to some
degree when estimating future scores using a prediction equation. Fortunately, SE est
enables us to determine how accurate test scores are likely to be.
The easiest way to understand SE est is to reflect on the concept of the standard
deviation of a sample of scores. The standard deviation is the average amount of dis-
tance between a given score and the sample mean in a distribution of scores. Using
the standard deviation, we can determine how far away from the mean the scores
tend to be, and thus how accurately the mean represents the sample scores as a meas-
ure of central tendency. A large standard deviation indicates that the scores are spread
widely around the mean; a small standard deviation indicates less variability because
the scores tend to be clustered around the mean.
The standard error of estimate operates in a similar fashion, quantifying the av-
erage distance between predicted scores and persons' actual scores on the criterion.
A large SE est indicates that we are typically not very accurate in our predictions of a
person's eventual performance on the criterion measure. This means that, however
noble our intentions, our test is not a very good one, at least for this purpose.
1 30 Chapter 4
However, if the SE es[ is small, our predictions, though not perfect, on average are
coming close to the person's eventual performance on the criterion measure. The for-
mulas for the standard error of estimate are
S ( r-ry or (42)
N-2
SE„=5, > /l-r X Y 2 (4-3)
In Equation 4.2, each person's predicted score (Y) is subtracted from the per-
son's criterion score (K). This residual is then squared for each person, and all the
squared residuals are added up. This numerator is divided by a denominator, the
value of which is the number of persons in the sample minus two (N — 2). Take the
square root to attain SE esI . Equation 4.3 multiplies the square root of 1 minus the
square of the validity coefficient (ryy 2 ) times the standard deviation of the criterion
scores (s Y ) in the validity study.
The SE est can be used to identify the overall level of accuracy of predictions
by referring to a table of areas under the normal curve. For example, the area of
the normal curve that lies between a z-score of 1 .96 and -1 .96 is 0.95, or 95% of
the area. In a normal distribution of scores, 95% of the scores fall between these
points. Similarly, because errors of prediction are random, 95% of criterion scores
lie within 1.96 standard scores (1.96 x SE est ) of their predicted value. Consider a
distribution of predicted scores with a SE est of 2.0. If a person's predicted score (Y')
on the criterion measure was 30, simple arithmetic would indicate a 95% proba-
bility that the person's actual eventual criterion score would be somewhere between
33.92 [30 + (1.96 x 2.0) = 30 + 3.92 = 33.92] and 26.08 [30 - (1.96 x 2.0) =
30 - 3.92 = 26.08]. Whether this is accurate enough is a judgment call embedded
with ethical ramifications made by those using the test against cutoff scores.
To conclude this section, let's return to the Schmidlapp Widget Company with
the prediction equation (V = 5.0 + 0.7A) and standard error of estimate (SE est =
2.0). Ms. Schmidlapp has informed you that, on average, her employees must be
able to assemble 40 widgets per week to keep the company solvent. To be on the safe
side, Ms. Schmidlapp determines that no applicants should be hired whose predicted
score on the criterion (Y) is less than 40. What cutoff score on the test should the
professional counselor recommend? Substituting the available values into the predic-
tion equation, 40 = 5.0 + 0.7X and using simple algebra procedures, it is determined
that X= 50. Thus, the professional counselor recommends that Ms. Schmidlapp hire
only those applicants who score 50 or higher on the test, understanding that some
will produce fewer than 40 widgets and some will produce more than 40. In fact,
Ms. Schmidlapp can be 95% certain that all applicants with a test score of 50 will
produce somewhere between 36.08 and 43.92 widgets each week. Of course, be-
cause the test is imperfect (as all tests are), Ms. Schmidlapp will hire a few applicants
who will not perform adequately and will not hire others who could have made nu-
merous magnificent widgets. Also, if Ms. Schmidlapp needs to boost profits at some
point, she always can raise the present minimum acceptable score of 50 to a higher
score.
Validity 131
Think About It 4.1 As an example of how to calculate the Standard
Error of Estimate (SEE), assume that for the T score scale (M = 50 and
SD = 10) of the Couriers' Adult ADHD Rating Scales (CAARS) Diagnostic
and Statistical Manual of Mental Disorders — Fourth Edition (DSM-IV)
Inattention subscale, the score reliability for a sample of clients is 0.91.
Using Equation 3.5 for the SEM and Equation 4.3 for the SEE, we obtain:
SEM = 10Vl-0.91=3.00 and SEE = 10^/(0. 91)(l-0. 91) =2.86.
CONSTRUCT VALIDITY
Evidence for construct validity is established by defining the construct being meas-
ured and by gradually collecting information over time to demonstrate or confirm
the meaning of what the test measures (Kaplan & Saccuzzo, 2001). Construct valid-
ity is widely used in assessment of theoretically defined domains, such as personality
traits, psychological disorders, and intelligence. In each case, the test author carefully
defines the construct under consideration, then designs a test to measure it, and col-
lects evidence supporting the validity of the test as a measure of the construct. The
principal means by which construct validity is established include convergent evi-
dence, discriminant evidence, factor analysis, meta-analysis, developmental changes,
and distinct groups (Whiston, 2005).
Convergent validity evidence is gathered by correlating the scores on a test with
scores on other tests believed to measure the same or very similar constructs. High
positive correlations are evidence of convergent validity, in that scores on the two
tests converge on each other, pointing toward the same psychological characteristic.
For example, both the Minnesota Multiphasic Personality Inventory— Second Edition
(MMPI-2) (Butcher et al., 2001) and the California Psychological Inventory (CPI)
(Gough & Bradley, 1996) have scales that measure the construct "Dominance."
High scores on both scales would be convergent validity evidence. A strong negative
correlation with a scale that measures the same trait using a reversed scaling method
or measures an opposite trait would also indicate convergence. For example, valid
scores on a scale on dominance should be expected to correlate negatively with a
scale that measures passivity.
Discriminant validity evidence is derived by demonstrating that test scores are
not highly correlated with measures of other, unrelated constructs. A personality
scale that accurately measures self-esteem should not correlate highly with a measure
of extraversion, though high levels of self-esteem may be associated with social par-
ticipation in some people. Introverts with high self-esteem, however, will not be as
likely to engage in social activity with people they do not know well. To be a distinct
measure of self-esteem, scores on the instrument in question should not be impacted
by the introversion or extraversion of the test taker, theoretically speaking. Low cor-
relations between measures of these unrelated constructs provide evidence of dis-
criminant validity. Discriminant and convergent techniques are especially important
in the validation of personality tests (Anastasi & Urbina, 1997).
132 Chapter 4
Think About It 4.2 How is a combination of evidence of convergent
and discriminant validity useful in determining the overall validity of test
scores?
Factor analysis conducts a complex statistical evaluation to determine the degree
to which the items contained in two separate instruments tend to group together
along factors that mathematically indicate similarity, and thus a common meaning. In
addition, factor analysis can determine to what degree the subscales of two tests are
similar to each other, as indicated by their lining up together on factorial vectors
(Whiston, 2005). For example, subscales measuring dominance, sensitivity, or toler-
ance should line up with similar scales on another test if the evaluated scores are valid.
Meta-analysis considers the results of a number of validation studies, combining
the results to identify an overall effect, if one exists. Synthesizing the results of numer-
ous validity studies can demonstrate strong evidence for the validity of a given test.
Developmental changes indicate support for the construct validity of a test
when the test measures changes that are expected to occur over time. For example,
we may be interested in measuring the thinking processes of children in light of
Piaget's model of cognitive development. A valid test would discriminate between
concrete operations and formal operations and would show increased levels of for-
mal operations thought among young people as they moved from childhood to ado-
lescence, as is expected developmentally. More generally, older children would be ex-
pected to obtain higher raw scores on intelligence or achievement tests than younger
children. Note that developmental age or grade changes are necessary but not suffi-
cient conditions for establishing construct validity; that is, achievement test scores
had better become higher as children get older, or the test developer has some real ex-
plaining to do.
Distinct groups can provide evidence of construct validity if their scores are dif-
ferent in an expected direction from scores of people in other groups or the general
population. If we had a test designed to measure leadership, we would expect a group
of military officers to score higher, on average, than the general population. Because
the identified distinct group is logically assumed to possess the characteristic in ques-
tion, one expects them to score high on the test. The degree to which they do indi-
cates the extent to which the test measures leadership.
In conclusion, it is important to remember the crucial step of defining the
construct carefully before attempting to demonstrate the validity of an assessment
instrument. There are many tests that measure intelligence, self-esteem, depres-
sion, and marital compatibility, to name just a few constructs. No two tests are
necessarily measuring the same construct just because they use the same name for
that construct. Referring back to an earlier example, both the MMPI-2 and CP1
have scales measuring the personality trait of "Dominance" (Duckworth &
Anderson, 1995; Gough & Bradley, 1996). The CPI's scale defines a dominani
person as "being strong in face-to-face situations and as being able to influence
others, to gain their automatic respect, and, if necessary, to control them" (Gough
Validity 1 33
& Bradley, 1996, p. 76). The MMPI-2 identifies its Dominance scale as "a fairly
simple measure of a person's ability to take charge of his/her own life" (Duckworth
& Anderson, 1995, p. 340) and as measuring "poise, self-assurance, resourceful-
ness, efficiency, and perseverance" (p. 341). Note that the MMPI-2 Dominance
scale has no indication of a desire to influence or control others, while the CPI's
scale does. In fact, the MMPI-2 scale indicates the desire to influence others only
when other scales are elevated. Both scales carry the name "Dominance," but they
do not measure identical constructs.
Finally, keep in mind that the definitions of various constructs change as soci-
ety evolves and knowledge changes over time. Consider the emergence of emotional
intelligence (Goleman, 1995), a construct derived from research in behavior and the
processes of the brain, but not specifically measured by any of the major intelligence
tests currently in use.
THE INTERACTION OF RELIABILITY AND VALIDITY
Quite simply, a test can never be more valid than it is reliable. Recall that a reliable
test score is a mostly true estimate of a person's actual ability or characteristic, with
only a little error contained in the test. If a test score is mostly composed of testing
error, it cannot possibly be mostly composed of accurate assessment of the construct
or ability in question. Stated another way, because unreliable test scores do not meas-
ure accurately and/or consistently, it is difficult to demonstrate that they measure
any particular construct or ability accurately and consistently. It is possible to have a
reliable test without knowing exactly what it measures. Whatever it measures, a reli-
able test does so consistently. Logically, though, it is not possible to have valid test
scores that are unreliable.
The reliability of predictor and criterion measures in criterion-related assess-
ment is also an important factor in determining test score validity. Equally important
is the reliability of comparison instruments used in convergent and discriminant
construct-related validity and in factor analysis. Using instruments with low reliabil-
ity in an effort to compile validity data on a test of interest inevitably introduces
error into the resultant validity coefficients.
VALIDITY AND TESTING PRACTICE
Test validity is important because decisions about which test to use and conclusions
as to what scores indicate about clients are derived from our understanding of what
the test measures. Following are some important considerations when using a partic-
ular test with clients:
■ Because a test cannot be more valid than it is reliable, always become familiar
with the reliability of test scores, including the methods by which evidence for re-
liability was established.
■ Consider the size and makeup of the samples used in reliability and validity stud-
ies. As in other forms of research, smaller samples make it less reasonable to gen-
eralize results of the study to the population (Harris, 1998). If at all possible, the
1 34 Chapter 4
norming samples should be representative of the clienr(s) with whom you plan
to use the test. If it is not, use caution in interpretation, taking into considera-
tion the reality that factors other than those the test is designed to measure may
be affecting your client's score.
■ Examine any test you use for biased items. Items may be more familiar to some
identifiable groups of people than others. For example, a test item picturing a
winter snow scene may be perceived differently by those who grew up in tropi-
cal climates than by those whose winters were routinely snowy.
■ Language is a significant contributor to potential bias, especially if the test is
written in a language in which the test taker is not proficient. Use caution in ap-
plying scores from tests that place the client at a disadvantage due to linguistic
differences.
■ Ethnicity can be a source of response variation in testing. Cultural differences
can lead to different outcomes on a personality test, for example, even when lan-
guage difference is not an issue. One culture's definition of appropriate behavior
can be very different from another's, leading to erroneous assumptions about an
individual's personality that actually emerge from cultural norms.
■ Do not assume that the name of a test or scale accurately reflects the actual mean-
ing of the test score. Always read the test manual to determine the exact defini-
tion of the skill or construct being measured.
■ Where possible, use more than one test or scale to increase the accuracy of as-
sessment. Using more than one predictor increases the likelihood of correctly
predicting a client's outcome score. Using more than one personality assessment
provides more complete information about the trait under consideration, espe-
cially if the tests purport to measure the same construct.
■ Tests are not proven to be valid. The validity of a particular test score for use with
a particular client under the circumstances at hand is a judgment call made by
the professional counselor based on the amassed evidence supporting the test's
validity and defining its meaning. Because professional counselors should use
tests only with the intent of being helpful to the client, ask if this is the right test
for the right client for the right reasons.
THE APPLICATION OF VALIDITY:
DECISION MAKING USING TEST SCORES
The primary purpose behind administering psychological and educational tests is to
help make accurate decisions that will benefit clients and students. Psychometricians
and statisticians have developed a number of procedures for making decisions using
a single test and multiple tests.
Decision Making Using a Single Score
By definition, decision making using a single test is relegated to the realm of a screen-
ing procedure. There are three popular procedures for single-score decisions: deci-
sion theory, linear regression, and setting a cutoff score.
Validity 1 35
Decision theory
Decision theory (Anastasi & Urbina, 1997) involves the collection of a screening
test score and a criterion score, either at the same point in time (i.e., concurrent de-
cision) or at some point in the future (i.e., predictive decision). Some common ex-
amples of concurrent decisions would be virtually any clinical or diagnostic study in
which a screening test for a mental or emotional disorder (i.e., depression, anxiety,
Attention-Deficit/Hyperactivity Disorder [AD/HD], dementia) would be adminis-
tered concurrently with a clinical diagnosis from a qualified mental health profes-
sional (sometimes called diagnostic validity), or the administration of an academic
achievement test to a group of children and concurrent identification of low-per-
forming students or students "at risk" for academic failure by a teacher or diagnosti-
cian (sometimes called decision reliability). Examples of predictive decisions would
involve any of these previous examples, but with the criterion of diagnosis or deter-
mination of "at risk" status being collected months or years after the screening test
was administered. In this way, the screening test would be used to predict future
problems, usually allowing professional counselors and educators to put prevention
or early intervention programs in play to lower the incidence of future problems.
Whether used for concurrent or predictive purposes, the goal of the procedure is to
maximize the likelihood of accurate decisions (sometimes called hits) while minimiz-
ing inaccurate decisions (sometimes called misses or errors). Remember, the ultimate
purpose of a screening procedure is to identify clients or students in need of deeper-
level diagnostic assessment.
As an example of applying decision theory, assume that a professional counselor
has been asked to develop an accurate screening procedure to identify adults at risk
for depression. The professional counselor first explores the literature and selects a
published, efficient screening device for depression whose scores have previously
demonstrated sufficient reliability and validity for screening-level purposes. To deter-
mine the adequacy of the depression inventory for the requested service, the profes-
sional counselor arranges for each new adult referral to several area clinics to com-
plete the depression inventory and undergo a diagnostic evaluation with a qualified
mental health professional. Selection of the criterion is critical. It is often viewed as
the "gold standard" and should have the qualities of excellent score reliability and
validity. This diagnostic evaluation would normally serve to identify mental and
emotional disorders related to the clients' presenting problems and to aid in estab-
lishing goals for counseling but because of the study's focus will also result in a clin-
ical determination regarding the degree of clinical depression in the clients on a 5-
point scale (e.g., 1 = Absence of Depressive Symptoms, 2 = Slightly Depressed, 3 =
Mildly Depressed, 4 = Moderately Depressed, 5 = Severely Depressed). {Note:
Admittedly, the diagnosis of depressive disorders is complex; for the sake of this ex-
ample, the process has been simplified). The professional counselor then collects two
pieces of data for each of the next 50 adult clients to the area clinics: (1) the screen-
ing test score and (2) the clinical decision of the presence of clinical depression on
the 5-point scale. The results of these 50 participants are presented in Figure 4.1.
As can be seen in Figure 4. 1 , the distribution of scores is somewhat broad, rang-
ing from to 50 on the depression screening test (0 indicates the Absence of
1 36 Chapter 4
M
B
9
DC
c
o
Q.
Q
5
II
•
(6) False Rejections
1
(21) Valid Acceptances
4
•
•
•
•
•
• •
•
•
3
^
*
9
*
*
*
*
III
IV
2
• •
»•
• •
»
• *
•
• a
(20) Valid Rejections
(3) False Acceptances
1
•
10
20
30
40
50
Identified
Criterion Cutoff
Not Identified
I
Test Score Cutoff -^
Score on the Depression Screening Test
Figure 4.1 An application of decision theory using a criterion cutoff score of 3
Depression; 50 is the highest score possible and indicates Severe Depression), and
from 1 to 5 on the clinical diagnostic rating (1 indicates the Absence of Depressive
Symptoms; 5 indicates Severe Depression). The professional counselor now needs to
use judgment in applying the decision-making model. How this judgment is applied
may vary and, as will be seen below, has implications for the accuracy of decisions
(i.e., the hit rate). One can see from Figure 4.1 that a criterion score cutoff line has
been placed at scores of 3 or higher, and a test score cutoff line at scores of 20 or
higher. The criterion cutoff have the teacher, mother, and father complete the respective versions
of the DBRS, then plug their scores into the regression formula. For Juanita,
assuming X x = 73, X 2 = 67, and X 5 = 55, the prediction formula would be: Y' =
1.21 + (0.031)(73) + (0.024)(67) + (0.017X55) = 1.21 + 2.263 + 1.608 + 0.935 =
6.016. Thus Juanita would be identified as having fulfilled the diagnostic criteria for
AD/HD-PIT. For another, more distractible child, Nakita, presenting with scores of
142 Chapter 4
Table 4.2 T Scores on the DBRS Distractible Subscale for Three Students and Criterion Cutoff Scores
Student name
Teacher score (X,) Mother score (X 2 ) Father score (X 3 )
Decision
Juanita
70
Nakita
78
Susanna
37*
Cutoff score required
65
67
88
49*
65
55
90
40*
65
No
Yes
No
Note: ' designates a scote falling below the tequited cutoff scote of T = 65.
X x = 78, X 2 = 88, and X 3 = 90, the prediction formula would be: Y' = 1.21 +
(0.031X78) + (0.024)(88) + (0.017)(90) = 1.21 + 2.418 + 2.112 + 1.530 = 7.270.
Thus, Nakita would be identified as having fulfilled the diagnostic criteria for
AD/HD-PIT. For a third, less distractible child, Susanna, presenting with scores of
X x = 37, X 2 = 49, and X 3 = 40, the prediction formula would be: Y' = 1.21 +
(0.031X37)"+ (0.024)(49) + (0.017)(40) = 1.21 + 1.147 + 1.176 + 0.68 = 4.213.
Thus, Susanna would not be identified as having fulfilled the diagnostic criteria for
AD/HD-PIT.
The primary advantage of the multiple regression technique is that it allows
some scores to compensate for other scores. For instance, while the results were not
in doubt in either Nakita's or Susanna's case, in Juanita's case, her father viewed her
level of distractibility to be more or less normal (T = 55), while her teacher's and
mother's scores were elevated (T = 73 and 67, respectively). These scores compen-
sated for the low score of the father and put Juanita in the "diagnose" category. A
primary disadvantage of the multiple regression technique is the necessity of labor-
intensive preliminary data collection, data analysis, and standard setting. It is a lot of
work to collect the several hundred protocols necessary to yield a reliable multiple re-
gression equation.
Multiple cutoff method
The multiple cutoff method is far simpler to set up and implement than the multi-
ple regression procedure. Basically, multiple cutoff means that the professional coun-
selor must establish a minimally acceptable score on each measure under considera-
tion, then analyze the scores of a given client or student and determine whether each
of the scores meets the given criterion. Importantly, failure to meet even one of the
cutoff scores will eliminate an examinee from consideration. As an example, consider
the scores on the DBRS for the three girls, which are now presented in Table 4.2 for
ease of comparison.
The criterion score standard-setting decision is of critical importance in the
multiple cutoff technique because criterion scores set too low will overidentify indi-
viduals who do not have the condition, and criterion scores set too high will under-
identify individuals who do have the condition. In the context of this multiple cut-
off technique example, Nakita would be identified with AD/HD-PIT because each
of her T scores on the DBRS exceeded the minimum criterion T score of 65.
Likewise, Susanna would not be identified because none of her T scores on the
Validity 1 43
DBRS was high enough to warrant diagnosis. Interestingly, Juanita, who did qualify
under the multiple regression procedures explained in the preceding section, would
not be identified with AD/HD-PIT using these criterion scores because her father's
rating of her did not meet the specified criterion (i.e., his rating of Juanita was a T
score of 55, and a minimum score of 65 was required).
It is important to understand that multiple cutoff techniques use hard-and-fast
criteria, and violations are not allowed. Thus a low score on one test can effectively
eliminate someone from consideration; other scores are not allowed to compensate
for deficient scores, such as was the case in the multiple regression model. Therefore,
a less than optimal administration for any reason (i.e., low motivation, response bias,
faking bad or good) could result in a selection error. Because the multiple cutoff
method is easier to set up and manage than the multiple regression method, it is
more widely used. However, most clinicians use a third method, clinical judgment
and diagnosis using a test battery.
Think About It 4.3 How could you apply the multiple regression or
multiple cutoff models to a decision-making problem in your area of coun-
seling specialty?
Clinical judgment and diagnosis using a test battery
Clinical judgment relies on the experiences, information processing capability, the-
oretical frameworks, and reasoning ability of the professional counselor to make
sense out of sometimes-conflicting information, to arrive at a rational decision about
the disposition of a client or student. Clinical judgment is not a statistics-driven de-
cision-making method per se. Test results, interview information, behavioral obser-
vations, and other data are interpreted and integrated, leading to a reasoned judg-
ment or decision. Clinical decision making using a test battery can be a very complex
undertaking, depending on the presenting problem(s), and requires a good deal of
education, supervised training and experience, and analytical capabilities. It is also
subject to theoretical differences and examiner bias; that is, the same information
often leads to different conclusions based on a professional counselor's theoretical
orientation(s) and personal or professional biases. A clinical case of a young girl eval-
uated for problems with distractibility is presented in Box 4. 1 to demonstrate how
data can be interpreted and integrated so that a clinical decision can be made.
Box 4.1 Clinical Judgment Using a Battery of Tests:
Case Study of Nakita
Identifying Information
Name: Nakita
Chronological Age: 1 2 years, 2 months
Grade Placement: 6.6
continued
1 44 Chapter 4
Box 4. 1 continued
Reason for Referral and Initial Case Conceptualization
Nakita was referred for psychoeducational evaluation by her mother. The
primary referral concerns were distractibility, difficulty understanding and/or
following directions, and poor school performance in the academic areas of
reading, science, and written expression. No significant emotional issues
were reported by the parents or school. Initially, this evaluator sought to ex-
plore the existence of a significant learning disorder in reading and writing
and significant degrees of inattention commonly associated with AD/HD. A
general emotional and behavioral screening was also undertaken to rule in or
rule out conditions that mask and mimic the symptoms of inattention, as
well as determine Nakita's general level of emotional adjustment.
Assessment Techniques
Because the referral concern was both behavior (inattention) and academic
(language arts, science), the examiner chose instruments that would be useful
in the identification of potential learning problems and behavior disorders,
such as AD/HD, and would also screen for emotional adjustment. The fol-
lowing assessments were intentionally selected at the outset of the evaluation:
■ Wechsler Intelligence Scale for Children — Fourth Edition {WISC-IV) (as an
intellectual assessment to establish an anchor score for expected achieve-
ment levels and to determine learning strengths and weaknesses)
■ Beery s Developmental Test of Visual-Motor Integration {VMI-3, Motor, and
Visual) (as a gross screen for visual perception, fine-motor coordination,
and visual-motor integration)
■ Woodcock-Johnson Tests of Achievement — Third Edition (WJ-III ACH) (to
establish achievement levels in the major academic subject areas and deter-
mine whether a learning disorder is evident)
■ Conners' Parent and Teacher Rating Scale — Revised: long Versions ( CPRS-R.T.
and CTRS-R.I) (to screen for inattention and other behavioral/emotional
concerns)
■ Clinical interview (exploration of developmental history and clinical con-
ditions using structured protocols found in Appendixes A and C of Erford,
2006).
The following tests were also administered as a result of additional questions
and hypotheses that came up during the evaluation:
■ Test of Auditory Perceptual Skills-Revised ( TAPS-R) — Word Discrimination
and Auditory Processing subtests (to rule out auditory perceptual and pro-
cessing deficiencies)
■ Jebsen Writing Speed subtest (to assess for handwriting speed, sometimes
deficient in clients with fine-motor coordination and processing speed
difficulties)
■ Stanfbrd-Binet Intelligence Scale — Fourth Edition: Memory for Sentences sub
test (to assess for language-loaded short-term auditory memory skills)
Validity 1 45
■ Wide Range Achievement Test — Third Revision (WRAT-3): Spelling subtest
(as a validating spelling test)
■ Slosson Written Expression Test (SWET) (for further exploration of writing
mechanics)
■ Visual Aural Digit Span Test ( VADS) (for further exploration of short-term
auditory and visual memory difficulties)
Background Information
Clinical interviewing using a structured protocol and reports from the
teachers provided a wealth of helpful background information. Nakita is a
12-year, 2-month-old African American girl currently attending grade 6 at
XYZ Middle School. Her mother reports the primary concerns to be age-
inappropriate inattention and difficulty in the academic areas of language
arts and sciences. Nakita is reported to be easily distracted by the slightest
sound and easily frustrated. She is very artistic and enjoys drawing. She has
struggled with reading since the first grade. Currently, reading comprehen-
sion appears to be problematic, as well as understanding word problems in
math. Recently, Nakita has begun to struggle in science, and this difficulty
appears to result from a complex interaction of reading comprehension,
conceptual difficulties, and teaching style. Nakita also reportedly has diffi-
culty following multistep directions, although it is unclear whether this
difficulty is due to a lack of understanding or to a lack of motivation. She
has a wonderful sense of humor, but is becoming more temperamental
when it comes to academic tasks.
Previous group-administered testing indicated Average to High Average
school ability on the Otis-Lennon School Ability Test (OLSAT). Her 5th-grade
achievement testing indicated Average math achievement (46th percentile),
reading comprehension (58th percentile), and writing mechanics (30th per-
centile). Mr. Trig, Nakita's math and social studies teacher, is concerned
about Nakita's weak skill retention in math. Nakita reportedly needs a lot of
practice and relearning to keep her grades in the passing range. He also re-
ports that Nakita is very distractible and impulsive. Socially and emotionally,
Mr. Trig describes Nakita as a very pleasant and kind student who is always
smiling. Mrs. Bookworm, Nakita's language arts teacher, reports that Nakita
often becomes talkative and "clowns around" during inappropriate moments
in class — often when answering questions or presenting in front of the class.
Because of being behaviorally off-task, Nakita often misses important infor-
mation and displays inconsistent comprehension. Mrs. Bookworm also re-
ports that Nakita has a wonderful zeal for learning and a sense of humor that
often energizes classroom activities. She is a hard worker and frequently par-
ticipates in classroom discussions. She is also very loyal and supportive of
friends. Although Nakita struggles with higher-order thinking skills, compre-
hension, and writing mechanics, Mrs. Bookworm believes that she is a
bright, tenacious, and capable student.
continued
146 Chapter 4
Box 4.1 continued
Nakita attended XYZ Elementary from kindergarten through grade 5.
Reading has always been an area of academic difficulty. She has traditionally
displayed a poor sight-word vocabulary and reading comprehension. She has
not displayed letter-number reversals since grade 1 . Nakita is currently
placed in the "low" math group, according to her mother. Her math calcula-
tion skills appear satisfactory, but Nakita is struggling with the story prob-
lems. Nakita's short-term memory (both auditory and visual) is reportedly
poor. Written language has also been an area of consistent difficulty. Her
spelling, capitalization, and punctuation skills are reportedly deficient. She
has excellent penmanship, and is a fast keyboarder. Nakita taught herself to
keyboard and is very proud of her ability in this regard.
Nakita's parents divorced five years ago. Nakita has an older sister who is
a very strong student. Nakita does engage in periodic day visits with her fa-
ther, but no overnight stays. Nakita's birth and developmental history was
normal, and she met all developmental milestones either on time or ahead of
time. Her medical history is unremarkable. Nakita is reportedly a happy, so-
ciable child. She is very outgoing and popular with peers. Her mother and
teachers report that Nakita's social and emotional development is within nor-
mal limits and not of primary concern at this time.
Maternal family history reportedly is negative for learning and emotional
problems. Her mother reports she was a straight-A student and not at all dis-
tractible. She completed one year of college and is currently employed in real
estate management. Nakita's birth father was not available for interview.
Nakita's mother reports seeing many similarities in learning styles between
Nakita and her father. She indicated that Nakita's father was a strong math
student, but struggled academically — although no specific details were pro-
vided. He did not finish high school and is currently a construction worker.
She indicated that Nakita's father enjoyed reading and was very artistic but
had poor writing skills. He reportedly had great difficulty focusing his atten-
tion on task and was easily distracted. A paternal grandmother reported that,
as a child, Nakita's father was very overactive. A paternal brother has been di-
agnosed with depression and, reportedly, is aggressive and possesses a temper.
Nakita's father also reportedly has difficulty controlling his temper.
The formal evaluation was conducted over two mornings in consecutive
weeks. Formalized evaluation centered on the areas of intellectual, percep-
tual, achievement, behavioral, and emotional development. Nakita was a
well-mannered child and was very cooperative during the evaluation.
Rapport was easily established, and she attempted all items presented to her.
Nakita displayed a quite high interest level throughout the evaluation. She
displayed no obvious physical or sensory deficits, nor did she appear anxious.
Therefore, the obtained results are considered to be an accurate representa-
tion of Nakita's current level of functioning. Her test results, briefly inter-
preted, are given in Tables 4.3 through 4.6.
Nakita was administered the Wechsler Intelligence Scale for Children —
Fourth Edition {WISC-IV) to establish a level of expectation for scholastic
Validity
147
Table 4.3 What Nakita's scores mean
Standard score
Scale score
T scon
130+
16+
70+
120-129
14-15
63-69
110-119
12-13
57-62
90-109
9-11
43-56
80-89
6-8
37-42
70-79
4-5
30-36
55-69
3
20-29
40-54
2
10-19
<40
0-1
<10
Interpretive range meaning
Very Superior
Superior
High Average
Average
Low Average
Borderline
Mildly Deficient
Moderately Deficient
Severe and Profoundly Deficient
Wechsler Intelligence Scale for Children — Fourth Edition (WISC-IV)
IQ; Range Percentile rank; Range Interpretive range
Verbal Comprehension Index 1 19; 1 1 1-125
90; 77-95
High Average to Superior
Perceptual Reasoning Index 1 17; 108-123
87; 70-94
Average to Superior
Working Memory Index 74; 68-84
4; 2-14
Mildly Deficient to Low Average
Processing Speed Index 75; 69- 87
5; 2-19
Mildly Deficient to Low Average
Full Scale IQ 100; 95-105
50; 37-63
Average
Verbal Comprehension Index subtests
Perceptual Reasoning Index subtests
Similarities 14 S*
Block Design
11
Vocabulary 1 2
Picture Concepts
13
Comprehension 14 S
Matrix Reasoning
14 S
Working memory index subtests
Processing speed index
subtests
Digit Span 5 W*
Coding
5W
Letter-Number Sequencing 5 W
Symbol Search
6W
Note: * S = Intrapersonal strength; W = Intrapersonal weakness.
achievement and identify her learning strengths and weaknesses. Nakita's
Verbal Comprehension Index (VCI) score was measured to lie in the High
Average to Superior range (percentile rank = 90; percentile rank range =
77-95), commensurate with her Perceptual Reasoning Index (PRI) score,
which fell in the Average to Superior range (percentile rank = 87; percentile
rank range = 70-94). While Nakita currently performs in the Average range
of general cognitive ability (Full Scale percentile rank = 50; percentile rank
range = 37-63), her true educational potential is probably much closer to
her VCI and PRI capabilities (standard score of approximately 1 18; High
Average to Superior capabilities), and it is this score that will serve as the an-
chor score for determining intrapersonal weaknesses and achievement areas
in need of improvement. Nakita's Working Memory Index (WMI) score fell
continued
1 48 Chapter 4
Box 4.1 continued
in the Mildly Deficient to Low Average range (percentile rank = 4; percentile
rank range = 2-14), as did her Processing Speed Index (percentile rank = 5;
percentile rank range = 2-19). Both the WMI and PSI were significantly
below current ability estimates and are considered significant intrapersonal
weaknesses. Subtest analysis indicates that Nakita displayed intrapersonal
strengths on tasks requiring verbal abstract reasoning (Similarities subtest
percentile rank = 90); social comprehension (Comprehension subtest per-
centile rank = 90); and visual analogical reasoning (Matrix Reasoning subtest
percentile rank = 90). Significant intrapersonal weaknesses were noted on
tasks requiring short-term auditory recall (Digit Span subtest percentile
rank = 5); recall and organization of auditory stimuli (Letter-Number
Sequencing percentile rank = 10); short-term visual recall and psychomotor
speed (Coding subtest percentile rank = 5); and speed in processing visual
information (Symbol Search subtest percentile rank = 10). Thus Nakita
presents as a bright child with potential weaknesses in processing speed and
in short-term auditory and visual memory.
Stanford-Binet Intelligence Scale — Fifth Edition: Sentence Memory subtest
Standard Score = 92 Percentile Rank
29
Test of Auditory Perceptual Skills-Revised ( TAPS-R)
Auditory Word
Discrimination subtest Scaled Score = 1 1
Auditory Processing
subtest Scaled Score = 1 2
Percentile Rank = 63
Percentile Rank = 75
Because a presenting concern had to do with Nakita's ability to under-
stand directions, it was important to explore the possible existence of a lan-
guage processing disorder and central auditory processing disorder. The
above-mentioned WISC-IWCA subtest results do not support the existence
of a language processing disorder because they all fell in the above-average
ranges. To rule out the existence of a central auditory processing disorder,
two subtests from the Test of Auditory Perceptual Skills-R were administered.
Nakita performed in an Average to High Average capacity on each subtest.
She scored at the 63rd percentile rank on a task requiring auditory word dis-
crimination and the 75th percentile rank on a task purporting to measure
auditory processing. Thus little support was garnered for the existence of a
central auditory processing disorder.
To further assess Nakita's short-term auditory recall, the Memory for
Sentences subtest of the Stanford-Binet Intelligence Scale — Fourth Edition was
administered. Nakita performed at the 29th percentile on this task, commen-
surate with WMI estimates and significantly below intellectual estimates.
Visual Aural Digit Span Test ( VADS)
Visual Memory
Auditory Memory
10th percentile
i^di percentile
Next, the VADS was administered to validate weaknesses in short-term
auditory and visual memory observed during administration or the WISC-
Validity 1 49
IV On this administration of the VADS, Nakita scored at the 10th and 25th
percentiles on the visual and auditory memory components, respectively.
Both performances were significantly below expected levels and validate the
weaknesses observed during administration of the WISC-IV. Thus the exis-
tence of significant distractibility in the auditory and visual channels remains
as a primary explanation for Nakita's difficulty in successfully performing in
class and carrying out multistep directions.
Notice that each "hypothesis" generated from the presenting problem is
being systematically explored through clinical interviewing and results from
selected tests.
Jebsen Writing Speed Subtest Trial 1 = 22 seconds (approximately the 15th percentile)
Trial 2 = 23 seconds (approximately the 1 5th percentile)
To validate the apparent weakness in processing speed, the Jebsen Writing
Speed subtest was administered and resulted in deficient writing speed per-
formances. The 15th percentile is one standard deviation below the mean,
indicating that about 85 percent of same- aged girls can write faster than
Nakita. This slow motor speed was commensurate with the deficient
Processing Speed Index scores reported above. These results are extraordinar-
ily important when trying to understand the academic difficulties that
Nakita is currently facing. These results indicate that Nakita's processing and
writing speed are substantially slower than expected for a child of her ability.
This is likely to be evidenced in the classroom through slower writing, note-
taking, and task completion speeds.
Test of Visual Motor Integration VMI Standard Score = 120 Percentile Rank of 91
Visual Standard Score =122 Percentile Rank of 93
Motor Standard Score = 90 Percentile Rank of 25
Nakita's performance on Beery s Developmental Test ofVisual-Motor
Integration — Third Edition (VMI-3) exceeded that of 91% of other children
her age, falling in the High Average to Superior range of performance. This
edition of the Beery also allows exploration of visual-perceptual and motor ca-
pabilities. Nakita's fine-motor coordination performance exceeded only 25%
of age-mates (Low Average to Average), while her performance on the visual-
perceptual task of the VMI-3 was High Average to Superior (93rd percentile
rank). Altogether, Nakita's visual-motor and visual discrimination capabilities
appear well developed at this time, actually exceeding current intellectual abil-
ity estimates. However, her fine-motor coordination is poorly developed.
In an effort to explore Nakita's current educational achievement and de-
termine whether significant learning problems are occurring in the areas of
reading and writing, selected subtests of the Woodcock-Johnson: Tests of
Achievement — Third Edition (WJ-III), the Wide-Range Achievement Test —
Third Edition (WRAT-3), and the Slosson Written Expression Test (SWET)
were administered.
continued
1 50 Chapter 4
Table 4.4 Woodcock-Johnson Tests of Achievement-Third Edition (WJ-III) (Conversions based on age norms)
Subtest
Standard score
Percentile rank
Range
Word identification
Passage comprehension
Reading fluency
Math calculation
Applied problems
Math fluency
Spelling
Writing samples
Writing fluency
105; 97-1 13
114; 102-126
90; 85- 95
103:93-113
96; 84-108
78; 74- 82
86; 76- 96
111; 97-125
88; 83-93
64; 41-80
83; 56-96
25;16-37
59; 32-80
39; 15-70
8; 4-12
18; 6-39
77; 43-95
21;17-32
Average to High Average
Average to Superior
Low Average to Average
Average to High Average
Low Average to Average
Borderline to Low Average
Borderline to Average
Average to Superior
Low Average to Average
Table 4.5 Slosson Written Expression Test (SWET)
Subscale/Scale
Scaled/Standard score
Percentile rank
Interpretive range
Writing maturity
100; 90-110
50; 25-75
Average
Type-token Ratio
11; 9-13
63; 37-84
Average to Above Average
Av. Sentence Length
9; 6-12
37; 10-75
Below Average to Average
Writing mechanics
81; 76- 90
10; 5-25
Deficient to Average
Spelling
7; 5- 9
16; 5-37
Deficient to Average
Capitalization
6; 4- 8
10; 1-25
Very Deficient to Average
Punctuation
8; 6-10
25; 10-50
Below Average to Average
Written expression total SS*
89; 83- 97
23; 13^42
Below Average to Average
Note: ' SS = Standard Score (M = 100: SD = 15)
Box 4.1 continued
Wide- Range Achievement Test — Third Revision
Spelling Subtest Standard Score = 88 Percentile Rank = 21
Nakita was administered the Woodcock-Johnson Tests of Achievement —
Third Edition {WJ-III) to explore reported weaknesses in language arts con-
tent areas. On the tests of reading, some task variability was noted as her pas-
sage comprehension skills {percentile rank range = 56-96; Average to
Superior) were slightly better developed than her sight-word vocabulary {per-
centile rank range = 41-80; Average to High Average). Both of these areas
were commensurate with current ability estimates. However, her reading flu-
ency was significantly below expected levels given current ability estimates
{percentile rank range = 16-37; Low Average to Average). Reading fluency is
a function of processing speed, reading speed, and attentional control, and
this performance represented a 28-point discrepancy below ability.
In mathematics, Nakira's calculation skills were Average to High Average,
exceeding approximately 59% of age-mates {percentile rank range = 32-80),
Validity 151
while her problem-solving capabilities were Low Average to Average, exceed-
ing approximately 39% of age-mates [percentile rank range = 15-70). Her
math problem-solving skills were slightly to significantly below current abil-
ity estimates (a 22-standard-score-point discrepancy). However, her math
fluency score was very significantly below expected levels given current abil-
ity estimates [percentile rank range = 4-12; Borderline to Low Average).
Math fluency is a function of processing speed, computational speed, and at-
tentional control, and this performance represented a 40-standard-score-
point discrepancy below ability.
Nakita's written expression in context (Writing Samples subtest) was sig-
nificantly better developed [percentile rank range = 45-95; Average to
Superior) than her spelling skills in isolation (Spelling subtest percentile rank
range = 6-39; Borderline to Average). Her written expression was commen-
surate with ability estimates, while her spelling skills were significantly defi-
cient. The WRAT-3 Spelling subtest was administered to further explore
Nakita's spelling skills and she performed at the 21st percentile (Low Average
to Average), confirming deficient spelling skills. She appears to struggle sub-
stantially with nonconventional spelling patterns. Interestingly, her Writing
Samples responses were frequently inappropriately punctuated and capital-
ized and were comprised of simple vocabulary and sentence structure. To
further explore the nature of suggested writing difficulties, the Slosson
Written Expression Test [SWET) (Hofler, Erford, & Amoriell, 2001) was ad-
ministered. The SWET requires the student to compose a story about a pic-
ture cue, and the product is scored for writing maturity and mechanics. On
this administration of the SWET, Nakita's Writing Maturity Index was
slightly below expected levels, but her Writing Mechanics Index was 37 stan-
dard-score points below current ability estimates. Importantly, her capitaliza-
tion, punctuation, and spelling were consistently poorly developed. Thus, a
Disorder of Written Expression (mechanics) is evident to a significant de-
gree. In addition, Nakita's Writing Fluency subtest score was very signifi-
cantly deficient in comparison with current ability estimates (percentile rank
range = 17-32; Low Average to Average). Writing fluency is a function of
processing speed and attentional control, and this performance represented a
30-point discrepancy below ability.
Because a referral question was whether Nakita possessed significant
problems with inattention, clinical and behavioral assessments focused on
the presence of age- and ability-inappropriate levels of distractibility, primary
symptoms of an Attention-Deficit/Hyperactivity Disorder (AD/HD).
Mr. Trig and Mrs. Bookworm, teachers who have instructed Nakita and
who are well acquainted with her academic and behavioral performance,
completed the Conners' Teacher Rating Scale — Revised, Long Version [CTRS-
R:L). Nakita's mother completed the Conners' Parent Rating Scale — Revised,
Long Version [CPRS-R:L). All respondents indicated substantial concerns re-
garding Nakita's inattentive behaviors, indicating that Nakita frequently
continued
152 Chapter 4
Table 4.6 [Nakita's results from the Conners' Rating Scales— Revised]
Conners' Parent Rating Scale
Revised: Long Version
(CPRS-R:L)
Conners' Teacher Rating Scale
Revised: Long Version
(CTRS-R:L)
Conners' Teacher Rating Scale
Revised: Long Version
(CTRS-R:L)
Respondent: Nakita's mother
Respondent: Mr. Trig
Respondent: Mrs. Bookworm
Scale T Score
Scale
T Score
Scale T score
A. Oppositional 61
A. Oppositional
46
A. Oppositional 50
B. Cognitive Problems 67*
B. Cognitive Problems
76*
B. Cognitive Problems 74*
C. Hyperactivity 49
C. Hyperactivity
54
C. Hyperactivity 48
D. Anxious/shy 49
D. Anxious/shy
46
D. Anxious/shy 46
E. Perfectionism 42
E. Perfectionism
41
E. Perfectionism 49
F. Social Problems 45
F. Social Problems
46
F. Social Problems 46
G. Psychosomatic 51
L. DSM-IV: Inattentive 67*
L. DSM-IV: Inattentive
76*
L. DSM-IV: Inattentive 68*
M. DSM-IV: Hyper-Impulsive 47
M. DSM-IV: Hyper-Impulsive 55
M. DSM-IV: Hyper-Impulsive 46
Note: * designates a score falling above the required cutoff score of T = 65.
Box 4.1 continued
avoids engaging in tasks requiring sustained mental effort; fails to give close
attention to details; has difficulty sustaining attention on tasks; is easily dis-
tracted by sights and sounds; loses things needed for tasks; and has difficulty
concentrating. Each of these items loads heavily on inattention, a core com-
ponent of AD/HD — Predominantly Inattentive Type. All other personality
and behavioral functioning was reported to be well within normal limits.
A clinical interview involving both Nakita and her mother confirmed
much of the evidence substantiating a mild to moderate attentional defi-
ciency without the associated hyperactive features. However, because it has
been well documented in research literature that myriad conditions exist that
mask and/or mimic the symptoms associated with AD/HD, an exhaustive
interview was conducted to rule out more than two dozen clinical and cogni-
tive disorders that often lead to misdiagnosis (see Appendix C of Erford,
2006). Upon concluding this interview, Nakita was determined to not dis-
play substantial symptoms associated with disruptive behavior, anxiety, or
depressive disorders. No medical history of lead poisoning, hyperthyroidism,
or allergies in Nakita or family members was reported. Nakita does not ex-
hibit a visual or auditory processing disorder, and her history is reportedly
negative for physical or sexual abuse and abuse of alcohol or other drugs. No
tic or seizure disorders, hallucinations, or delusions were reported or evi-
denced, and Nakita displayed a history of positive social relationships and in-
teractions. Thus myriad conditions shown to mask and/or mimic AD/HD
were ruled out.
In conclusion, behavior rating scales, cognitive-perceptual information,
and clinical interview confirm that Nakita lullills the diagnostic criteria for
Validity 1 53
Table 4.7 DSM-IV-TR diagnostic summary for Nakita
Axis I — 314.00 AD/HD — Predominantly Inattentive Type
314.5 — Developmental Coordination Disorder (fine-motor)
315.2 — Disorder of Written Expression (mechanics, spelling)
315.9 — Learning Disorder — NOS (processing speed)
Axis II — None
Axis III — None
Axis IV — Academic/testing problems
Axis V — Global Assessment of Functioning (GAF) (current) = 69
AD/HD — Predominantly Inattentive Type, Developmental Coordination
Disorder, and Disorder of Written Expression. These conditions are
presently mild to moderate in severity and are affecting her schoolwork
production and performance. Also of concern is a deficiency in processing
speed [Learning Disorder — Not Otherwise Specified (NOS)] that ad-
versely impacts her motivation to engage in written expression and other
academic activities and affects the quality of written expression and other
academic output.
Final Conceptualization and Recommendations
Nakita is a 12-year-old girl currently attending grade 6 at XYZ Middle
School. She currently performs in the Average range of general intellectual
ability, but her VCI and PRI index scores indicate her intellectual capabilities
are much higher (deviation IQ estimate = 118). Deficiencies in processing
speed and short-term auditory and visual recall were noted. A significant
achievement deficiency was noted in written expression and spelling
(Disorder of Written Expression). This inconsistency is often apparent in
children with deficient processing speed because the speed of their written
expression cannot keep up with the flow of ideas they are trying to commu-
nicate. Frequently inattentive and disorganized, Nakita fulfills the diagnostic
criteria for Attention-Deficit/Hyperactivity Disorder — Predominantly
Inattentive Type. In addition, Nakita displays a Developmental
Coordination Disorder (fine motor). At this time, the extent of these disor-
ders appears mild to moderate in severity and affects Nakita's schoolwork
production and performance.
The following recommendations are offered:
1. Nakita's mother is encouraged to share the results of this evaluation with
Nakita's physician and to seek the physician's guidance in developing a
treatment plan that addresses Nakita's inattentiveness and disorganization.
2. Nakita may benefit from short-term remedial tutoring in written expres-
sion and mechanics. In particular, this course of action should address a re-
view of written-language mechanics rules (punctuation, capitalization, and
continued
1 54 Chapter 4
Box 4. 1 continued
grammar in context), as well as composition construction strategies and
skills.
3. Nakita can be helped to better understand task directions when she and
her teachers and parents break down multistep directions into a sequence
of ordered steps. It will help to:
■ Write them down and number the steps so Nakita can complete the
steps one at a time.
■ Have Nakita check with an adult after completing each step and be-
fore moving on to the next step. She currently is experiencing a good
deal of frustration by making mistakes and misunderstanding direc-
tions in the early steps of a multistep task. Having an adult check her
progress at each step before moving on will help eliminate some of this
frustration.
■ Be sure Nakita is on the right track when beginning the assignment.
■ Give an example of what she is to do.
■ Check her progress frequently.
■ Have Nakita rephrase directions in her own words to be sure she un-
derstands them.
■ Have a well-organized student help Nakita transition from step to step.
■ Have Nakita do two or three examples under the supervision of a
teacher, parent, or student helper to be sure she understands the
process before beginning to complete items independently.
■ Make sure multistep directions are written down, whether on the paper,
a chalkboard, or an index card.
4. Classroom and home-study modifications that may facilitate Nakita's aca-
demic performance include:
■ Consider creating compositions with a written outline and verbally
constructing the composition on audiotape. A transcription of the au-
diotape can then be made, embellished on, and proofread. This proce-
dure will capitalize on Nakita's verbal strengths and minimize the frus-
tration that ensues by her forgetting good ideas when trying to
construct compositions from memory.
■ Encourage Nakita to further develop keyboarding skills to facilitate her
typing. She should strive to type at a rate of greater than 40 words per
minute by the beginning of her 9th-grade year.
■ Allow Nakita to compose compositions and other written work on a
word processor. She should immediately begin to type and edit her
written work using the word processor.
■ Cut back repetitive homework assignments beyond the point of
mastery.
■ Give Nakita preferential seating near the primary area of instruction,
with her back facing any distracting students or stimuli.
■ Surround her with focused role models who will not distract her and
who will not allow Nakita to distract them.
Validity 1 55
Classroom and home-study modifications that may facilitate Nakita's be-
havioral and work habit adjustments include:
■ A daily assignment notebook that allows daily or at least weekly com-
munication between the parents and teachers.
■ Praise and encouragement that emphasize Nakita's accomplishments
and successes (no matter how small).
■ Brief verbal reprimands addressing behaviors, not perceived motiva-
tions, followed by praise and encouragement for successes.
■ Behavioral contracts that identify specific academic and behavioral
goals.
■ The use of a timer to break assignments into smaller time units of more
intense focus. For a preteenager with a short attention span, timed
units should not generally exceed 1 5 to 30 minutes. After a short break
with plenty of performance feedback and encouragement, as well as
some physical movement or exercise, the next timed task can ensue.
■ Appropriate home and school study spaces, with set times, no distrac-
tions, and a recognized routine.
Treatment of AD/HD can be addressed best through a combination of:
■ Parent, teacher, and student education on the nature and treatment of
AD/HD.
■ Behavior modification to address educational and behavioral issues.
■ Educational modifications to make Nakita more successful in the
classroom.
■ Medical intervention as determined by Nakita's attending physician.
Because of Nakita's slow processing speed, she will benefit from extra time
given to complete standardized tests, particularly timed, group-adminis-
tered tests of achievement. Extra time should also be given, as needed, on
in-school tests so that Nakita's grades will reflect mastery of content, rather
than suppression due to time constraints.
A primary strength of using clinical judgment is the flexibility it affords the de-
cision maker. A seasoned examiner will be quick to admit that the various data ac-
cumulated during an evaluation do not always agree. There are times when two tests
purporting to measure a similar construct may yield dissimilar results. There are
times when teachers, spouses, mothers, and fathers who are asked the same set of
questions about the same client will vary widely in their responses, sometimes due to
varying perceptions, response bias, or the intent to deceive. In fact, it is more often
the case that some data do conflict, thus requiring great skill and judgment on the
part of the examiner to realize what to focus on and what not to focus on. In these
instances, clinical judgment is indispensable as a tool for reckoning divergent infor-
mation from diverse data sources. However, this same flexibility can also lead to ex-
aminer bias and a decision-making process that lacks reliability (i.e., consistency) and
usefulness. Indeed, some have demonstrated that statistical models, compared to
clinical judgment models, lead to more reliable and valid decisions.
1 56 Chapter 4
Table 4.8 Example of a multiple regression/multiple cutoff
hybrid decision-making model
In the following scenario, a decision must be made to select the three most qualified applicants.
X v X 2 , and X i are the scores on the selection tests. Y' is the predicted criterion score based on
the multiple regression equation: V = a + b^X l + b 2 X 2 + b-Ji.^ . The minimum cutoff scores for
each selection variable are X^ = 20, X 2 = 15, and X$ = 25. The "All met" column indicates
whether the client's scores on each of the selection tests (X,, X 2 , and A",) met or exceeded the
minimum cut score and, therefore, can be considered for final selection. "Final Rank" indicates
the final ranked position of the "surviving" candidates. The top three ranked candidates (marked
by an asterisk) will be deemed most qualified and offered the positions. (Note that Candidate H
was selected even though Candidate J had a higher V , because X 2 for Candidate J was below the
minimum criterion, effectively eliminating Candidate J from consideration.)
Participant
*,
*2
*3
r
All met
Final rank
A
22
20
20
22.65
Yes
5
B
18
19
25
23.17
No
X
C
14
12
21
18.74
No
X
D
20
15
25
22.99
Yes
4
E
24
19
30
27.20
Yes
1*
F
17
16
25
22.21
No
X
G
22
19
28
25.72
Yes
2*
H
21
18
25
23.95
Yes
3*
I
11
11
12
13.85
No
X
J
23
14
28
25.00
No
X
Note: * = the top three ranked candidates.
Combining decision-making models
Sometimes a combination of these three methods can lead to greater accuracy. For
example, strict adherence to a multiple cutoff method may at times be softened by
clinical judgment that takes into account a client's or student's extenuating circum-
stances — circumstances not accounted for by the multiple cutoff method, but
nonetheless important. This happens frequently with educational decisions (i.e.,
grade retentions, exceptions to course requirements, college admission and scholar-
ship applications) and clinical decisions (i.e., use of the designation "Not Otherwise
Specified"). Alternatively, multiple regression and multiple cutoff methods can be
used in conjunction to select the "cream of the crop" in a two-stage process. Stage 1
involves applying the multiple regression equation to client scores and rank ordering
the client's scores according to the magnitude of the client's predicted criterion score
(V). Stage 2 involves standard setting to determine the minimal cutoff for each se-
lection test score and then applying these multiple cutoff criteria to (he same scores
analyzed in stage one. This process, an example of which is provided in Table 4.8,
may eliminate some of the individuals who benefited from the multiple regression
process, which allowed a compensation for low scores, and may eliminate them from
Validity 1 57
final selection. Such a procedure is particularly helpful when the cost of selecting an
unqualified person may be too prohibitive or the risk of failure too detrimental. Of
course, any decision-making method has strengths and weaknesses, and will virtually
never be foolproof. Selection of an appropriate decision-making model must be un-
dertaken with great care to ensure the rights and protection of clients and students.
SUMMARY/CONCLUSION
KEY TERMS
Test validity is about whether or not (and to what degree) a test score measures what
it claims to measure. Validity is closely related to, and dependent on, test reliability.
Evidence for test score validity is determined in several ways. Content validity con-
siders the degree to which a test adequately represents the breadth of content of the
domain being examined. Criterion-related validity correlates scores on the predictor
variable (test score) with those on the criterion or outcome measure. Criterion-re-
lated validity may be predictive, in which the predictor and criterion measures are
gathered at different times, or concurrent, in which both are gathered at the same
time. A prediction equation can be used to predict a person's score on the outcome
measure from the individual's score on the predictor variable. The standard error of
estimate indicates the degree of accuracy of predictions. Construct validity uses con-
vergent and discriminant forms of validity assessment. Convergent construct valid-
ity is established by showing high correlations between the new test and other estab-
lished measures of the same or similar constructs. Discriminant construct validity is
evidenced by low correlations between the new test and measures of unrelated con-
structs. It is important to establish a clear definition of the construct in order to
know what the test score means. Finally, the use of a given test is a based on informed
judgment to be made by a competent counselor for the benefit of the client.
Decision making using a single test score is generally done through one of three
processes: setting a cutoff score, linear regression, and application of decision theory.
Decision making using multiple tests frequently makes use of clinical judgment,
multiple cutoff, or multiple regression methods. Each of these methods has strengths
and weaknesses, and each requires varying degrees of expertise and sophistication.
Most professional counselors use clinical judgment methods based on a theoretical
framework and previous experience.
clinical judgment decision theory
concurrent criterion-related validity discriminant validity
construct domain
construct validity face validity
content-related validity false acceptance
convergent validity false rejection
criterion intercept
criterion-related validity linear regression
cutoff scores multiple cutoff method
1 58 Chapter 4
multiple regression
negative predictive power
positive predictive power
predictive criterion-related validity
restricted range
sensitivity
slope
specificity
standard error of estimate
total predictive value
valid acceptance
valid rejection
validity
TEST SELECTION
CHAPTER
5
Selecting, Administering,
Scoring, and Interpreting
Assessment Instruments
and Techniques
by R. Anthony Doggett, Carl J. Sheperis, Susan Eaves,
Michael D. Mong, and Bradley T. Erford
This chapter begins with issues related to proper test selection, administration,
and scoring, followed by discussion of proper interpretation of test scores from
both norm-referenced and criterion-referenced tests. A section regarding the
appropriate sources for obtaining information about assessment instruments has
been included to assist the reader in proper test selection. Finally, common errors
committed during the assessment process are discussed, along with recommenda-
tions for addressing these issues.
Appropriate test selection is crucial in the assessment process. Before selecting in-
struments, the professional counselor must first determine the purpose for engaging
in assessment activities. As discussed in Chapter 1, sometimes clinicians administer
different tests to determine if the individual meets criteria for a particular diagnosis,
to develop interventions or treatments for clients, to evaluate the integrity of services,
or to evaluate the outcome of receiving treatment. In any of these cases, the profes-
sional counselor must ensure that the instrument being used is adequate for the
159
1 60 Chapter 5
stated purpose of the assessment. As such, the instrument must be normed (or
criterion-referenced) on a representative population, contain items that are appro-
priate for evaluating the current referral concern, have adequate psychometric prop-
erties, and provide scores that lend themselves to appropriate outcome comparisons.
Choosing instruments that are not linked to the original purpose of the assessment,
lack technical adequacy, or are not appropriate for the referred problem or individ-
ual will reduce the professional counselor's ability to meet the client's needs and
could potentially expose the client to harmful and unwarranted experiences. Table
5.1 offers summary suggestions that professional counselors should consider when
selecting an instrument.
TEST ADMINISTRATION
After determining the purpose of the assessment, the professional counselor must
determine the best way to obtain the information needed from the client. While as-
sessments are designed to yield meaningful information about a client, the quality of
the information obtained is closely linked to the skills and abilities of the clinician
administering the test.
Administrator Requirements
It is important to mention that each assessment instrument requires a certain level
of training and/or education by the administrator. In other words, legally and ethi-
cally, professional counselors can select instruments only from the category of instru-
ments available for use according to their level of training. In clinical practice, these
requirements are often determined by state licensure laws. In schools and agency
work, state certifications or exemptions often exist that allow examiners to adminis-
ter tests they would otherwise be unable to use in the private sector. For example,
unlicensed professional counselors working in a correctional institution or for a non-
profit agency may be allowed to administer clinical or intelligence tests as a condi-
tion of employment. But, because they not licensed by a state counseling board, they
may not be able to administer those same tests to the public for a fee in a private
practice. The same is often true for professional school counselors and school psy-
chologists. They may be able to administer the Wechsler Intelligence Scale for
Children — Fourth Edition ( WISC-IV) or Woodcock-Johnson Tests of Achievement —
Third Edition {WJ-III ACH) during school hours to students as a condition of em-
ployment, but be prohibited from administering these same tests in private practice
for a fee. While some instruments can be used with knowledge gained from the man-
ual, others require in-depth supervised training. To assist in the process of delineat-
ing which instruments require which level of training, a majority of publishers use a
level system similar (albeit not identical) to that described below.
Level A
1 evel A instruments can be administered, scored, and interpreted after studying the
manual, with no additional training or education required. However, employment
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 161
Table 5.1 Guide to proper test selection
Test information
■ What is the name of the test?
■ Who are the test authors?
■ What company published the test?
■ When was the test published?
■ Are alternative forms of the test available?
■ How much does the test cost?
■ How long does it take to administer the test?
■ Is the test manual comprehensive (i.e., includes information on psychometrics, norms, item
development, etc.)?
■ Does the test have current norms and items?
■ Who is included in the standardization sample?
Test interpretation aids
■ Does the manual provide clear descriptions of the purposes and applications of the test?
■ Does the manual provide clear information regarding the training and qualifications needed
to administer the test?
■ Does the manual include example cases to aid in interpretation of the results?
Examinee considerations
■ What skills are needed by the examinee to take the test?
■ In what language are the test items written?
■ What is the reading/vocabulary level of the test items?
■ How are the test items presented?
■ How is the examinee expected to respond to the test items?
■ What adaptations can be made to the test items or test presentation to accommodate any
examinee disabilities?
■ Is the test free from bias?
■ Is the test administered to individuals or groups?
Technical adequacy
■ What types of reliability studies have been performed on the test scores?
■ What types of validity studies have been performed on the test scores?
■ Are the reliability and validity estimates adequate for the intended purpose?
Administration and scoring
■ Are the directions for administering the test appropriate and clear?
■ Are the directions for scoring the test appropriate and clear?
■ What options are available for scoring the test?
Interpretive scores and norms
■ Are the scales used for reporting test scores adequately presented and described?
■ Are the normative scores presented in an appropriate format (e.g., standard scores, percentile
ranks)?
■ Is the standardization sample appropriate and clearly described?
■ If more than one form of the test is available, are equivalent scores on the different forms
provided?
■ Does the test manual provide guidance on establishing local norms?
162 Chapter 5
or affiliation with an institution or organization is sometimes required before the
publisher will agree to send the instrument. The Self-Directed Search (Holland,
Fritzche, & Powell, 1994) is a Level A test.
Level B
Level B instruments require specialized knowledge of psychometric issues and test
score properties, usually obtained by taking a graduate-level course in assessment. To
qualify for this level's criteria, the professional counselor administering the test must
have a master's degree in counseling, psychology, or a related field. In addition, the
professional counselor must have specific training and/or licensure or certification
recognized by the test publisher. The Reynolds Adolescent Depression Scale — Second
Edition RADS-2, (WJ-III ACH), and Slosson Intelligence Test — Revision 3 SIT-R3 are
examples of Level B tests.
Level C
Level C instruments require substantial knowledge about the construct being meas-
ured and about the instrument being used. Often, a doctorate in counseling, psy-
chology, or a related field and/or appropriate licensure or certification is required. In
addition, the professional counselor should have specific coursework or training re-
lated to assessment (generally) and to the instrument (specifically) or class of instru-
ments (e.g., intelligence, personality, projectives). Test publishers commonly use the
general levels described, although the designations sometimes vary. In addition, there
are often exceptions and variations due to state laws or regulations that the profes-
sional counselor should check prior to selecting an instrument. The Rorschach
Inkblot Test, the Wechsler Adult Intelligence Scale — Third Edition ( WAIS-III) , and the
Minnesota Multiphasic Personality Inventory — Second Edition (MMPI-2) are examples
of Level C tests.
Finally, it is a magnificent practice to administer, score, and interpret a test
under the supervision of a highly trained practitioner a number of times and on vol-
unteer participants prior to using the test for decision-making purposes with clients.
How many "practice administrations" depends largely on the complexities of the
test. Practice administrations allow professional counselors to hone their skills on a
new instrument under competent supervision and in no-risk situations to enhance
the ultimate competence of the examiner.
Examinee Preparation
The first step in any assessment is to prepare the test takers for the test they are about
to take. Because many standardized tests are administered in school settings or to
school-aged youth in agency or private settings, professional counselors who work
with children and adolescents must be able to adapt their assessment skills and
knowledge toward younger age groups.
The professional counselor's job is to familiarize clients and students with the
type of assessment they will be taking. This may seem like common sense, but many
professional counselors fail to take into account the client's familiarity with the test
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 63
and the testing procedure. People should be informed of the type of test (e.g., math
achievement, career interest inventory, personality inventory), whether the test is
timed, and what the test is designed to measure. Professional counselors should ap-
proach the assessment positively while helping clients and students to ease and man-
age their test anxiety.
Environmental Concerns
Testing Procedures
Another important aspect of preparing students for assessments is preparing and
maintaining the proper testing environment. The proper assessment environment is
one that is distraction-free, provides proper space for working, and discourages
cheating. While the ideal assessment environment can be difficult to provide, test
administrators should strive to ensure there are relatively few distractions during the
testing process. Minimizing distractions can be accomplished by not allowing exam-
inees to wander around the room, make unnecessary noise, or have materials unre-
lated to the test with them in the testing session. Likewise, the examiner should en-
sure the testing environment provides sufficient lighting, temperature, and work
space for the task at hand.
The actual process of test administration is often very straightforward. Standardized
test administration is a very rigid and scripted process. The primary requirement for
administering a published test is that the examiner strictly follows testing procedures
described in the test's manual. Due to the many variations in testing procedures
found in different published tests, it is critically important that the test administra-
tor be familiar with the specific test procedures and materials used. Testing proce-
dures can include, but are not limited to, test directions, time limits, and registration
and identification procedures.
The majority of test manuals stress the importance of the manner in which test
directions are given. In most cases, test directions are to be read word for word fol-
lowing a script that is laid out in the manual. Any deviation from the protocol may
result in invalid test results. The primary function of verbatim instructions is to en-
sure that uniform testing conditions are present for all test takers. Whenever possi-
ble, professional counselors should memorize directions for administering and scor-
ing test items. Even though the test manual is still referred to, memorization tends
to help the administration flow more seamlessly, significantly reducing pauses by the
administrator to locate a needed passage or judge the accuracy of a client response.
Thus the professional counselor's demonstrated knowledge of and comfort with the
test helps to establish a relational rapport and projects administrator competence and
confidence.
While not all tests employ time limits, time limits are frequently a vital part of
the testing procedure. Test administrators should be familiar with the time limits for
different items or subtests. Administrators should also carry some sort of timing
1 64 Chapter 5
device (i.e., stopwatch, wristwatch, clock, egg timer) with them so that they are
aware of the time limits at all times. Ending a testing session too early or ending late
may result in invalid test results.
Many published tests also have specific procedures for examinee registration and
identification, particularly high-stakes aptitude or achievement tests (e.g., SATs,
graduate record examinations, advanced placement examinations). Sometimes ex-
aminees must identify themselves through means such as their names or Social
Security numbers. In an attempt to discourage cheating, some tests also require ex-
aminees to present one or more forms of identification, both before and after a test-
ing session. Professional counselors conducting assessments for employers, govern-
ment program eligibility, or even community mental health services must be equally
vigilant to ensure that client results are accurate and legitimate.
Despite the test publisher's vigorous attempts to provide a uniform testing ex-
perience for all examinees, there are sometimes deviations. According to the
Responsibilities of Users of Standardized Tests (RUST-3) statement (AACE, 2003a) and
Standards for Educational and Psychological Testing (AERA/APA/NCME, 1999), any
deviations from the test procedure should be documented by the test administrator.
Many test protocols contain a section in which the examiner may record and de-
scribe problems or unusual circumstances that may occur during the testing session.
The professional counselor should take any irregularities under consideration when
interpreting test results.
While deviations from standardized testing procedures are not required for the
average examinee, test administrators should be aware of the special considerations
given to examinees with disabilities or to very young examinees. The majority of
published group-administered tests (particularly those administered by schools, in-
stitutions of higher education, or licensure or other professional boards) require that
an examinee show proof of a disability before being given special accommodations
under the Individuals with Disabilities Education Improvement Act of 2004
(IDEIA), the Americans With Disabilities Act of 1991 (ADA), or Section 504 of the
U.S. Rehabilitation Act of 1973. While there is no set standard on the requirement
of proof of disability, many institutions or test publishers require that the examinee
in question have a written report on file that documents the disability. The reports
must come from a legitimate source, usually a licensed specialist, and must be cur-
rent (usually less than three years, depending on the test). Common considerations
for individuals with disabilities include extended time on tests, longer breaks, Braille
tests, oral instructions, dictated responses, and computer-assisted technology.
Factors Affecting Test Scores
During the process of test administration, the test administrator should be aware of
the many factors that can affect test scores. While the administrator should strive to
maintain these variables at a minimum level, not all variables can be controlled.
Table 7.1 (see Chapter 7) contains a summary of important test-related factors. A
comprehensive treatise of these factors affecting client and student responses is pro-
vided by It ford (2006).
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 65
TEST SCORING
By definition, test scores are simply the numerical result of testing. Test scores sum-
marize the information obtained through the testing process by using numbers that
the test administrator may interpret. The use of numbers allows test administrators
to describe and quantify examinee performance in a standardized manner.
Assessment instruments may be scored by a wide array of people. For example,
some instruments are designed to be self-scored (i.e., Level A tests). This type of scor-
ing usually consists of adding columns of scores or counting the number of items re-
sponded to. Some tests can also be scored by persons other than the client or exam-
iner (e.g., clerical staff, interns). While having others score assessments may save the
test administrator time on the front end, the test administrator should always
recheck the test scores to minimize the chance of error. Under most circumstances
in clinical practice, the professional counselor will score the protocols for Level B
and Level C tests. As stated above, the reason for this practice is because use of Level
B and C tests requires advanced education and training.
While most assessment instruments may be scored by hand, computer-assisted
scoring programs are becoming increasingly common. Some tests may also provide
templates to aid the examiner in scoring the test by hand. Despite the aid offered by
test templates, hand scoring for many tests is tedious to even the most experienced
examiner. Due to the increased time consumption necessary for hand scoring, many
examiners prefer to use computerized scoring programs and services for the longer or
more complicated tests. For example, for the MMPI-2, computer scoring time is vir-
tually instantaneous after the items are entered into the scoring program (which usu-
ally requires 5 to 10 minutes). Depending on the scoring program used, many to
most of the MMPI-2 scales listed in Table 7. 10 (see Chapter 7) can be obtained in a
matter of seconds. In contrast, using the scoring stencils may require well more than
an hour to obtain the same set of scores. Of course, both methods have risks of in-
accuracies due to human error. Thus, when using computerized scoring programs, it
is essential to double-check all score entries; when using scoring forms or stencils, it
is equally important to double-check the derived scores.
Wise and Plake (1990) conducted a study in which computer scoring was com-
pared to hand scoring. The researchers concluded that computer scoring is more ac-
curate, faster, and more thorough than hand scoring. An added advantage of com-
puter scoring is the fact that computers are completely unbiased. Unless modified
by the examiner, computers will not discriminate against examinees on the basis of
individual differences such as sex, religion, race, sexual preference, or socioeconomic
status. Computers can also aid examiners in complex test interpretations that can
take human interpreters days. Of course, this does not mean that the interpretations
derived by the computer are more accurate than those of a skilled clinician.
While computerized scoring procedures are a useful aid to clinicians, they are
not infallible due to their reliance on human programmers. In an attempt to mini-
mize computer scoring errors, the Standards for Educational and Psychological Testing
(AERA et ah, 1999) requires test scoring services to provide documentation of their
programming procedures.
1 66 Chapter 5
Despite the increasing availability of computer scoring programs, some types of
tests require human interpretation. For example, projective personality tests usually
require a professional counselor to interpret information that computers are unable
to perform, although recent efforts have resulted in attempts to standardize scoring
and interpretations of some techniques (Exner, 2002; McArthur & Roberts, 2005).
Professional judgment may also be required for some individually administered in-
telligence, aptitude, achievement, personality, and clinical tests.
It is always important for the professional counselor to remember that test scores
serve a wide variety of functions in a variety of different settings. School personnel
can use test scores to determine student placement. Teachers use test scores to ana-
lyze their lesson plans and teaching methods. Professional counselors can use test
scores to communicate examinee performance to clients, parents, or other stakehold-
ers. The common link among all the above examples is that professionals use test
scores to guide them in their decision-making responsibilities.
Professional Standards in Testing
Although each test publisher generally includes a set of minimum standards for the
examiner to follow, several professional organizations provide additional ethical
guidelines or standards for proper test administration and scoring. For example, the
American Counseling Association's Code of Ethics (ACA, 2005a), the RUST-3 state-
ment (AACE, 2003a), the Standards for Educational and Psychological Testing (AERA
et al., 1999), and the National Board of Certified Counselors' Code of Ethics (1989)
all encourage professional counselors administering tests to use appropriate proce-
dures, techniques, and strategies related to the consideration of individual differences
in sex, gender, ethnicity, and socioeconomic status of the examinee. Table 5.2 is of-
fered as an amalgamated guide for the proper administration and scoring of tests.
Table 5.2 Summary guidelines for administering and scoring tests
Examiner preparation
1. Administer only tests for which you have been thoroughly trained.
2. Read and learn all instructions.
3. Adhere to standardization procedures.
a. Cite instructions to examinees exactly as the test manual prescribes.
b. Present test items according to prescribed time limits.
c. Follow scoring guidelines rigidly.
d. Document any deviations from standardized procedures or testing irregularities.
4. Administer the test in an objective manner.
a. Reinforce participation but give no indication of accuracy or inaccuracy of examinee's
responses (e.g., "You're doing fine. Keep trying your best.").
b. Remember that you are testing, not teaching. Pay close attention to verbal (e.g.,
intonation of voice) and nonverbal cues (e.g., eye glances, head nods).
5. Administer the test in a natural manner.
c. Achieve rapport with the examinee before administering any test items.
d. Use standardized wording in a normal and nonthreatening manner.
6. Prepare the testing environment by removing distractions and avoiding clutter.
a. I Live the examinee lace away from doors, windows, or other areas that may distract
attention Irom the test,
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 67
b. Have the examinee complete the test in a quiet area.
c. When possible, avoid testing the examinee when he or she presents as hurried, worried,
or ill (unless these are the conditions that prompted the evaluation or the client's
normal state).
7. Provide optimum testing conditions.
a. Provide the examinee with comfortable seating and make sure he or she can see the test
materials clearly.
b. Provide a well-lit room with a comfortable temperature.
c. Provide instructions in a clear, audible voice at a moderate rate of speed.
d. Help the examinee maintain interest through enthusiastic presentation of the items and
attention for effort.
e. Provide social attention and encouragement for general performance, not for specific
items.
f. For maximum performance tests, let the examinee know that you want to see how well
he or she can do on this test administration.
Test administration
1 . Administer the test in an efficient manner. Have an efficient system for
a. Recording answers.
b. Viewing the manual without distracting the examinee.
c. Bringing out test materials and storing them away after use.
d. Avoiding delays.
2. Make smooth transitions from (sub)test to (sub)test.
3. Know test administration guidelines and test materials well enough to avoid overextending
the test experience for the examinee.
a. Always begin at designated starting points.
b. Score each item correctly and efficiently.
4. Learn how to appropriately handle distractions from the examinee.
a. Avoid attending to inappropriate remarks.
b. Ignore inappropriate movements if they are not distracting to the examinee's test
performance.
c. Redirect the examinee to the task at hand if remarks or movements become too
distracting.
Scoring the test items
1. Know the scoring standards well, so you thoroughly understand the intent behind each
item.
2. Remember that scoring standards provide guidelines for scoring items. When in doubt,
score examinee answers in relation to the intent behind the item.
3. Review the guidelines in the manual to verify any unclear answers provided by the
examinee.
4. Check and recheck every step of the scoring procedure.
5. Check and recheck all figures and calculations.
Test storage and care of materials
1. Place all examinee protocols and other information in client folders in a proper storage (i.e.,
locked) cabinet to protect the confidentiality of the responses and personal information.
2. Store all materials in a safe, secure place to prevent unwarranted wear and exposure to
untrained personnel.
3. Replace any materials that are worn so that these materials do not become distracting to the
examinee.
4. Point to pictures with a finger or eraser of the pencil to avoid placing marks on the page.
5. Replace any materials that are lost or damaged with objects identical to the original from the
testing company.
1 68 Chapter 5
NORM-REFERENCED INTERPRETATION
Tests are usually administered to assess important domains in the examinee's life. For
example, intelligence tests evaluate cognitive functioning; achievement tests evaluate
academic functioning; adaptive behavior measures evaluate important daily living
skills (e.g., communication, motor skills, social functioning); career inventories
measure interests, skills, and values; and clinical or personality measures evaluate
inter- and intrapersonal functioning. When these large domains of functioning are
assessed, the examinee's raw score is usually transformed and then compared to the
performance of other individuals with similar characteristics (e.g., age, gender, eth-
nicity). For a norm-referenced test, this population of individuals is referred to as the
standardization sample, normative sample, or the norm group. The comparison
scores are called derived scores and are placed into two groups: developmental scores
and scores of relative standing (Salvia & Ysseldyke, 2004).
Developmental Equivalents
One type of transformed or derived score is called a developmental equivalent. The
two most common types of developmental equivalents are age equivalents and grade
equivalents. Both of these equivalent scores are obtained by determining the average
score obtained on a test by different groups of examinees who vary in age or grade
placement. Specifically, an age equivalent means that the examinee's raw score is the
average (mean or median) performance for a particular age group. For example, if
the average raw score for 1 1 -year-old children (1 1 years, months) on a particular
test is 15 items correct out of a 30-item test, then any examinee obtaining a score of
15 would receive an age-equivalent score of 1 1-0 (11 years, months). Therefore,
the age-equivalent score is obtained by computing the mean or median raw score on
a test for a group of children of a specific age. It is also important to note that age-
equivalent scores are expressed in years and months with a hyphen between the year
and the month (i.e., 1 1 years, 2 months is expressed as 1 1-2).
A grade-equivalent score is obtained by computing the average (mean or me-
dian) raw score on a test obtained by examinees in a specified grade. For example, if
the average score of 6th-graders on a mathematics test is 25, then any examinee ob-
taining a score of 25 is reported to have math knowledge at the 6th-grade level.
Grade-equivalent scores are expressed in grades and tenths with a decimal between
the two numbers (i.e., 6.5 refers to the average performance of children at the mid-
dle of the 6th grade; 2. 1 refers to the average performance of children during the first
month of the 2nd grade)
Salvia and Ysseldyke (2004, pp. 92-93) appropriately pointed out five concerns
when using age- and grade-equivalent scores. These are:
1 . Systematic misinterpretation. Examinees who earn an age-equivalent score oMl-
have answered as many questions correctly as the average for examinees that
are 1 1 years of age. Obtaining this score does not mean that the examinee per-
formed on the test in the same manner that an 1 1 -year-old student would have
performed. In a similar fashion, a 2nd-grader and a 6th-grader may have both
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 69
earned a grade equivalent of 3.0; however, it is very probable that they did not
attack the items on the test in the same manner. Developmentally, their thought
processes may be quite different. In other words, it is essential to communicate
to clients, teachers, and parents that just because a 4th-grader receives a grade
equivalent (GE) of 8.5, this does not mean the student is as "smart" as an
8th-grader.
2. Interpolation and extrapolation. It is important to remember that average age- and
grade-equivalent scores are only estimates of functioning and represent groups
of examinees that were not actually tested. Loosely defined, interpolation means
guessing within the bounds of what is known. Thus, if one knows that a raw
score (RS) of 25 yields a grade equivalent of 2.5 (GE = 2.5) and a raw score of
35 (RS = 35) yields a grade equivalent of 3.5 (GE = 3.5), it is reasonable to con-
clude that each raw score point between 25 and 35 raises the grade equivalent by
0. 1 . Thus, a RS of 27 would be a GE = 2.7, and a RS = 33 would be a GE = 3.3.
Whether this has been demonstrated empirically or not, such interpolations
make sense because some empirical results do exist upon which to base a conclu-
sion. Interpolation, while often somewhat inaccurate, is quite benign in compar-
ison to extrapolation. Extrapolation involves guessing outside the bounds of what
is known. Following with the example above, what grade equivalents might one
assign to raw scores of less than 25, particularly if no one younger than a grade
level of mid-2nd grade actually made up the norm group? Extrapolation provides
these estimations. A test developer may extrapolate that the linear relationship
noted between GEs of 2.5 and 3.5 continues in the downward direction. Thus
the author assumes that a RS = 15 would yield a GE =1.5, and a RS = 19 would
yield a GE = 1.9, etc. Of course, such guesswork without the benefit of empiri-
cal support is shoddy at best, dangerous at worst. This is just one reason why de-
velopmental equivalents should be avoided.
3. Typological thinking. Examinees are always being compared to an average that
does not actually exist. For example, the average American family may be re-
ported to have 1.7 cars, with a 2.5-bedroom house, and 2.4 children. However,
it is simply impossible to have 0.4 of a child. Therefore, the average score simply
represents a statistical abstraction.
4. False standards of performance. Students are expected to perform at their age and
grade levels. Eleven-year-olds are expected to perform at the 1 1-0 level on a test,
and 6th-graders are expected to perform at the 6.0 level. However, equivalent
scores are constructed in such a manner that at least half (50%) of any age group
or grade group will perform at or below the age or grade level, because half of
the group always earns scores at or below the median. This means that a princi-
pal who insists that all 2nd-graders complete the year reading at a GE = 2.9 or
higher is being statistically naive. The professional counselor should explain that
in the average classroom, only 50% of 2nd-graders can be expected to be at
GE = 2.9 or higher.
5. Scales are ordinal, not equal-interval. The scales often used to obtain age and
grade equivalents are ordinal; therefore, the intervals are not equal. As a result,
the scores on these scales cannot be added, subtracted, or multiplied. Thus
170 Chapter 5
school systems that determine student eligibility for remedial services by requir-
ing the student's reading or math achievement to be "two grade levels below cur-
rent grade placement" are being statistically inappropriate. A two-grade-level dif-
ference yields very different results at different grade levels.
It is essential to note that developmental equivalents are frequently misunder-
stood, miscommunicated, and misused. While professional counselors should be
aware of the existence of developmental equivalents and be prepared to explain
them, professional counselors should avoid using developmental quotients when ex-
plaining client or student scores.
Scores of Relative Standing
Unlike developmental scores, scores of relative standing have equal units of meas-
urement. As such, scores on the same test for several different examinees of different
ages can be compared. Additionally, different scores on several different instruments
can be compared for the same person. The major types of scores of relative standing
used in norm-referenced measurement include standard scores and percentile ranks.
Figure 5.1 demonstrates the relationship between these scores.
Standard scores
Standard scores are raw scores that have been mathematically transformed to have a
designated mean and standard deviation. A standard score expresses how far an ex-
aminee's score lies in relation to the standard deviation of the norm group. Five com-
monly used standard-score distributions include: z-scores, T scores, deviation IQs,
normal-curve equivalents, and stanines.
Z-scores
A z-score has a mean of and a standard deviation of 1 . As such, a z-score simply in-
dicates how many standard deviations above or below the mean a given score falls.
A z-score is obtained by subtracting the mean of the norm group (M x ) from the ex-
aminee's raw score (X) and then dividing by the standard deviation {SD X ) of the
norm group \z = x I . Almost all z-scores (99.7%) lie between -3.0 and +3.0.
If an examinee obtains a z-score of 2.0, the examinee has performed 2.0 standard de-
viations above the mean of the group. A z-score of -1.5 is 1.5 standard deviations
below the mean of the group. A z-score of is at the mean performance of the group.
Z-scores are commonly used in empirical research studies.
T scores
In order to remove the - and + signs, z-scores are often transformed into other
scores, such as T scores. A T score has a mean of 50 and a standard deviation of
10. Many test manuals transform raw scores directly into T scores, but a z-score
can be transformed into a T score using the following formula: T = 10(z) + 50.
Using the examples above, a z-score of 2.0 would be transformed into a T score of
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 171
Number of scores
0.1%
2%
14%
34% 34%
68%
96%
14%
2%
0.1%
55
70 85 100 115 130
Score on Wechsler Adult Intelligence Scale
Figure 5.1 The normal curve and related standardized scores
145
70 (T = 10(2.0) + 50 = 70). Az-score of-1.5 would be transformed into aT score
of 35 (T = 10(-1.5) + 50 = 35). Az-score of would be transformed into aT score
of 50 (T = 10(0) + 50 = 50). T scores are commonly reported in behavioral, per-
sonality, and clinical inventories.
Deviation IQs
Deviation IQs have a mean of 100 and a standard deviation of 1 5 or 16, depending
on the instrument used (nearly all currently use SD = 15). All of the Wechsler Scales
have a standard deviation of 1 5; however, the Slosson Intelligence Test — Revised (SIT-
R3) (Nicholson & Hipshman, 1990) uses a standard deviation of 16. While most
test manuals transform raw scores directly into deviation IQs [M = 100, SD = 1 5), a
z-score can be transformed into a deviation IQ score using the following formula:
Dev. IQ = 15(z) + 100. Therefore, an examinee with a z-score of 2.0 would have a
deviation IQof 130 (Dev. IQ = 15(2.0) + 100 = 130). A z-score of-1.5 would be
transformed into a deviation IQof 78 (Dev. IQ= 15 (-1.5) + 100 = 78. A z-score of
would be transformed into a deviation IQof 100 (15(0) + 100 = 100). It is impor-
tant to note that the formula would change if the instrument has a standard devia-
tion of 16. For example, a z-score of 2.0 would be transformed into a deviation IQ
of 132 (16(2.0) + 100 = 132). Deviation IQ scores are frequently reported for tests
of intelligence, achievement, and perceptual skills.
Normal-curve equivalents
Normal-curve equivalents (NCEs) are standard scores with a mean of 50 and a stan-
dard deviation of 21.06. The standard deviation is set at 21.06 because this transfor-
mation divides the normal curve into 100 equal units or intervals.
1 72 Chapter 5
Stanines
Stanines is shortened from the term "standard nines." Stanines are standard-score
bands that divide a distribution into nine parts with a mean of 5 and a standard de-
viation of 2. These scores are expressed as whole numbers from 1 to 9. When scores
are converted to stanines, the shape of the original distribution changes into a nor-
mal curve. Stanines are frequently provided by publishers of large-scale testing pro-
grams. Their use should be limited, and caution in interpretation is warranted be-
cause educators and parents often express concern that a client's score has dropped
from, say, the fifth to the fourth stanine. In actuality, this "drop" could be a differ-
ence of a single raw score point.
Percentile Ranks
Percentile ranks, also referred to as percentiles, are derived scores indicating the per-
centage of individuals whose scores fall at or below a given raw score. It is impor-
tant to note that the terms percentile rank and percentage correct are not the same.
For example, a percentage score of 50 means 50% of the items were correct (a pro-
portion of correct to total points), while an examinee who obtains a percentile rank
of 50 on a standardized test has scored the same or better than 50% of the exam-
inees in the norm group. Percentiles allow comparison of a client's score with other
scores. Percentages only allow comparisons with some standards. Although per-
centile ranks are fairly easy to understand, their psychometric properties limit their
usefulness. Still, percentile ranks are essential staples in test interpretation because
of their ease of understanding. Unlike z-scores orT scores, percentile ranks are not
evenly distributed across the normal curve. In fact, raw score differences between
percentile ranks are smaller near the mean of the distribution and larger at the ex-
tremes of the distribution.
It is also essential to understand that small differences in a client's raw score
around the mean can lead to large changes in percentile rank. It is often helpful to
explain percentile ranks using a visualization of a line of 100 individuals of the same
age (or grade), with the 1st individual in the line being the lowest performer (e.g.,
poorest math student, least depressed, least hyperactive) and the 100th person in the
line being the highest performer (e.g., best math student, most depressed, most hy-
peractive). Thus an individual scoring at the 95th percentile rank exceeded the per-
formance of 95% of same-aged peers. A person scoring at the 5th percentile rank
outperformed only 5% of same-aged peers. Importantly, because the normal curve
theoretically runs in each direction to infinity, it is theoretically impossible to achieve
the percentile rank end points of or 100.
Think About It 5.1 How would you interpret a percentile rank score of
84 to a client being assessed lor depression? Be sure to include a good expla-
nation of what percentile ranks arc.
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 73
Table 5.3 SEM at a given age level
Age (yr-mth)
Reliability
68% LOC (± 1 SEM)
95% LOC (± 2 SEM)
99% LOC (± 2.58 SEM)
12-0-12-11
0.80
±6.7
13-0-13-11
0.83
±6.2
14-0-14-11
0.86
±5.6
15-0-15-11
0.90
±4.7
16-0-16-11
0.93
±4.0
17-0-17-11
0.96
±3.0
±13.4
±12.4
±11.2
±9.5
±7.9
±6.0
±17.3
±16.0
±14.5
±12.2
±10.2
±7.7
Note: Ages presented in years and months. Confidence intervals are reported in standard scores {M = 100; SD =15).
Quartiles
Percentile ranks that divide a distribution into four equal parts are called quartiles.
With quartiles, each part contains 25% of the norm group. The first quartile (Ql)
contains percentile ranks of <25; Q2 contains percentile ranks of 26-50; Q3 con-
tains percentile ranks of 51-75; and Q4 contains percentile ranks of >75.
Applying Standard Error of Measurement (SEM) to Test Scores
The score that a client or student obtains on a given test is called the observed score.
Recall from the discussion of reliability and standard error of measurement (SEAT)
in Chapter 3 that all test scores have some measurement error, and this score error
can be expressed using a band of confidence around the observed score to indicate
the likely presence of the true score (i.e., the client's actual score if no measurement
error was present). This confidence band reflects the test's standard error of measure-
ment, which is influenced by a test's reliability (see Chapter 3 for an explanation of
how SEM is computed). SEM is essential to test score interpretation because it is
misleading to report a score as if it is "the truth, the whole truth, and nothing but
the truth." Realistically, the score a client receives on a test may vary up or down on
readministration of that test — and this is normal. The more reliable a test score is,
the less variability will be expected upon retest; conversely, the lower the reliability,
the greater the variability.
Most test manuals and computer scoring programs provide SEM for standard
scores (SS) obtained by students and clients. Sometimes this information is included
in a table that indicates the SEM at a given age level and for a certain level of confi-
dence, such as provided in Table 5.3. In these cases, the confidence interval (CI) is
computed as CI = SS ± SEM. For example, if the observed score is a standard score
of 105 and the SEM equals 5 standard-score points, then the confidence interval is
105 ± 5 or a range of 100-1 10. However, an important consideration in determin-
ing confidence intervals is the level of confidence to display. Recall from Chapter 3
that ± 1 SEM is the 68% level of confidence, ± 2 SEM is the 95% level of confi-
dence, and ± 2.58 SEM is the 99% level of confidence. Under normal circumstances,
1 74 Chapter 5
Table 5.4 Observed scores with ranges of standard scores
Test
Standard scores; range
Percentile rank; range
Interpretive range
WISC-IV—IQ
WJ-III— Math Calculation
W] -III— Applied Problems
VMI-4
111; 101-121
92; 82-102
77; 67-87
95; 85-105
77; 53-92
29;12-55
6;1-19
37; 16-63
Average— Superior
Low Average-Average
Deficient— Low Average
Low Ave rage- Ave rage
Note: For the purpose of this example, it is assumed that 1 SEM = 5 standard score points for all four measures. Note that this is not usually the
case. All scores are interpreted at the 95% level of confidence (i.e., ±10 standard score points).
professional counselors should interpret scores at the 95% level of confidence (± 2
SEM)., meaning the client's true score will probably lie within the given range 95
times out of 100 (alternate-form administrations of the test).
Table 5.4 presents an example of several observed scores with ranges of standard
scores determined at the 95% level of confidence. Note that these scores have also
been converted into percentile ranks and interpretive ranges.
Notice that the observed WISC-IV IQ score is 111. Interpreting this score at the
95% LOC (level of confidence) with 1 SEM equal to 5 standard score points means
the range of scores surrounding the score is 111 ± 10, or 101-121 (i.e., if 1 SEM = 5
SS points, then 2 SEM = (2x5)= 10 SS points; thus 1 1 1-10 = 101 and 1 1 1 + 10 =
121). Next, these standard scores should be converted to percentile ranks to make
them easier to explain to clients, students, parents, teachers, or other stakeholders.
This can be easily accomplished by using Table 5.5. In this case, a deviation IQof 1 11
converts to a percentile rank of 77. Also, a SS of 101 is a percentile rank of 53, and SS
or 121 is a percentile rank of 92. Finally, the standard score range of 101-121 is con-
verted to the appropriate interpretive ranges (i.e., brief verbal descriptors), which can
also be found in Table 5.5. In this case, a SS of 101 is in the Average range, and a SS
of 121 is in the Superior range. Thus the interpretive range is Average to Superior.
A professional counselors interpretation of the scores in Table 5.4 when present-
ing them to clients, teachers, parents, guardians or other stakeholders might go like
this:
Juan's performance on the WISC-IV exceeded that of 77% of other children his
age. His true score probably falls in the percentile rank range of 53 to 92. This
performance is Average to Superior. His score on the WJ-III ACH Math
Calculation subtest exceeded the performance of 29% of other children his age.
His true score probably falls in the percentile rank range of 12 to 55. This per-
formance is Low Average to Average. Juan's performance on the WJ-III ACH
Applied Problems subtest, a measure of math problem-solving abilities, ex-
ceeded that of only 6% of other children his age. His true score probably hills
in the percentile rank range of 1 to 19. This performance is Deficient to Low
Average. Finally, his score on the Developmental Test of Visual-Motor
Integration (VMI-4) exceeded the performance of 37% of other children his
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 75
Table 5.5 Score conversion table
IQ
Percentile rank
Scaled score S
99.99
19
9
99.98
19
9
99.98
19
9
99.97
19
9
99.97
19
9
99.96
19
9
99.95
19
9
99.93
19
9
99.91
19
9
99.89
19
9
99.87
19
9
99.83
19
9
99.79
19
9
99.74
18
9
99.69
18
9
99.62
18
9
99.53
18
9
99
17
9
99
17
9
99
17
9
99
17
9
99
17
9
99
17
9
98
16
9
98
16
9
98
16
9
97
16
9
97
16
9
96
15
9
96
15
9
95
15
8
95
15
8
94
15
8
93
14
8
92
14
8
91
14
8
90
14
8
88
14
8
87
13
7
86
13
7
84
13
7
82
13
7
81
13
7
79
12
7
77
12
7
75
12
6
Stanine
Z-score
T score
NCE
Interpretive range
155
154
153
152
151
150
149
148
147
146
145
144
143
142
141
140
139
138
137
136
135
134
133
132
131
130
129
128
127
126
125
124
123
122
121
120
119
118
117
116
115
114
113
112
111
110
+3.67
87
+3.60
86
+3.53
85
+3.47
85
+3.40
84
+3.33
83
+3.27
83
+3.20
82
+3.13
81
+3.07
81
+3.00
80
+2.93
79
+2.87
79
+2.80
78
+2.73
77
+2.67
77
+2.60
76
+2.53
75
+2.47
75
+2.40
74
+2.33
73
+2.27
73
+2.20
72
+2.13
71
+2.07
71
+2.00
70
+ 1.93
69
+ 1.87
69
+ 1.80
68
+ 1.73
67
+ 1.67
67
+1.60
66
+ 1.53
65
+ 1.47
65
+ 1.40
64
+ 1.33
63
+ 1.27
63
+ 1.20
62
+ 1.13
61
+ 1.07
61
+1.00
60
+0.93
59
+0.87
59
+0.80
58
+0.73
57
+0.67
57
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
99
Very Superior
93
Very Superior
93
Very Superior
93
Very Superior
90
Superior
90
Superior
87
Superior
87
Superior
85
Superior
85
Superior
83
Superior
81
Superior
80
Superior
78
Superior
77
High Average
75
High Average
74
High Average
73
High Average
71
High Average
59
High Average
68
High Average
67
High Average
66
High Average
64
High Average
continued
176 Chapter 5
Table 5.5 continued
IQ
Percentile rank
Scaled score Stanine
Z-score
T score
NCE
Interpretive range
109
108
107
106
105
104
103
102
101
100
99
98
97
96
95
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
73
70
68
66
63
61
58
55
53
50
47
45
42
39
37
34
32
30
27
25
23
21
19
18
16
14
13
12
10
9
8
7
6
5
5
4
4
3
3
2
2
2
12
12
1
1
1
1
1
10
10
10
10
10
9
9
9
9
9
8
8
8
8
8
7
7
7
7
7
6
6
6
6
6
5
5
5
5
5
4
4
4
4
4
3
3
3
3
3
2
+0.60
+0.53
+0.47
+0.40
+0.33
+0.27
+0.20
+0.13
+0.07
0.00
-0.07
-0.13
-0.20
-0.27
-0.33
-0.40
-0.47
-0.53
-0.60
-0.67
-0.73
-0.80
-0.87
-0.93
-1.00
-1.07
-1.13
-1.20
-1.27
-1.33
-1.40
-1.47
-1.53
-1.60
-1.67
-1.73
-1.80
-1.87
-1.93
-2.00
-2.07
-2.13
-2.20
-2.27
-2.33
-2.40
-2.47
-2.53
56
55
55
54
53
53
52
51
51
50
49
49
48
47
A7
46
45
45
44
43
43
42
41
41
40
39
39
38
37
37
36
35
35
34
33
33
32
31
31
30
29
29
28
27
27
26
25
25
63
Average
61
Average
60
Average
59
Average
57
Average
56
Average
54
Average
53
Average
52
Average
50
Average
48
Average
47
Average
46
Average
44
Average
43
Average
41
Average
40
Average
39
Average
37
Average
36
Average
34
Low Average
33
Low Average
32
Low Average
31
Low Average
29
Low Average
27
Low Average
26
Low Average
25
Low Average
23
Low Average
22
Low Average
20
Borderline
19
Borderline
17
Borderline
15
Borderline
15
Borderline
13
Borderline
13
Borderline
10
Borderline
10
Borderline
7
Borderline
7
Very Deficient
7
Very Deficient
Very Deficient
Very Deficient
Very 1 )efii ient
Very Deficient
Vcr\ 1 >< tii ient
Very 1 )efi< ient
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 77
Table 5.5 continued
IQ
Percentile rank
Scaled score
Stanine
Z-score
T score
NCE
Interpretive range
61
0.47
2
-2.60
24
Very Deficient
60
0.38
2
-2.67
23
Very Deficient
59
0.31
2
-2.73
23
Very Deficient
58
0.26
2
-2.80
22
Very Deficient
57
0.21
-2.87
21
Very Deficient
56
0.17
-2.93
21
Very Deficient
55
0.13
-3.00
20
Very Deficient
54
0.11
-3.07
19
Very Deficient
53
0.09
-3.13
19
Very Deficient
52
0.07
-3.20
18
Very Deficient
51
0.05
-3.27
17
Very Deficient
50
0.04
-3.33
17
Very Deficient
49
0.03
-3.40
16
Very Deficient
48
0.03
-3.47
15
Very Deficient
47
0.02
-3.53
15
Very Deficient
46
0.02
-3.60
14
Very Deficient
45
0.01
-3.67
13
Very Deficient
Note: IQ means deviation IQ, or standard score (SS) {M = 100; SD = 15); %ile rank is a Percentile Rank (P)\ scaled score means (Af= \0; SD =
3); stanine means (M = 5; SD = 2); z-score means (M = 0; SD = 1); T score means (A/= 50; SD =10); NCE means normal-curve equivalent (M -
50; SD = 21.06).
age. His true score probably falls in the percentile rank range of 16 to 63. This
performance is Low Average to Average.
It is important to note that the interpretations offered above are statistical inter-
pretations. Statistical interpretation gives meaning and context to quantitative scores.
Another type of interpretation that is equally valuable is called qualitative or contex-
tual interpretation. In this type of interpretation, the professional counselor describes
what tasks the client can and cannot do, or provides rich content descriptions to help
the reader understand the nature of client developmental and clinical issues. The
quality of contextual interpretations is determined primarily by the level of expertise
and the theoretical or practical orientations of the professional counselor. For exam-
ple, professional counselors who are expert in describing the characteristics of per-
sonality disorders and the behaviors observed in a client with such a condition may
be able to provide a rich contextual description of the clients current circumstances
and how the personality disorder is expressed and affects the client.
Often statistical and contextual interpretations are combined in evaluation re-
ports. For example, when interpreting the results of a WAIS-III protocol, a more sta-
tistically oriented interpretation may be appropriate, supplemented by contextual
comments, as in the following example:
Intellectually, Jaime currently performs in the Average to High Average range of
general cognitive ability (Full Scale percentile rank = 82; percentile rank range =
75-89), as measured on the Wechsler Adult Intelligence Scale-Third Edition
1 78 Chapter 5
{WAIS-III). Her Verbal Comprehension skills were measured to lie in the Average
to High Average range (percentile rank = 82; percentile rank range = 70-90),
commensurate with her Perceptual Organizational skills, which also fell in the
Average to High Average range (percentile rank = 73; percentile rank range =
53-86). Because of these results, Jaime's Full Scale IQis the best choice of anchor
scores to represent her educational and intellectual potential and to determine
strengths and weaknesses.
On the Verbal Comprehension subtests from the WAIS-III, Jaime displayed
an intrapersonal strength on a task requiring social comprehension and problem
solving (Comprehension subtest percentile rank = 99; Very Superior). No intra-
personal weaknesses were noted as her profile of verbal cognitive performance
was well balanced. She performed in an Average to High Average capacity on
tasks requiring verbal abstract reasoning (Similarities subtest percentile rank =
75), and general information (Information subtest percentile rank = 63). Her
word knowledge and facility performance (Vocabulary subtest) exceeded 95% of
age-mates, falling in the Superior range of performance.
On the Perceptual-Organizational subtests of the WAIS-III, Jaime dis-
played no significant strengths, but did display a significant intrapersonal
weakness on a task requiring nonverbal spatial reasoning (Block Design sub-
test percentile rank = 25; Low Average to Average). Nonverbal spatial reason-
ing is usually associated with math problem solving and advanced mathemat-
ical reasoning, an area that Jaime has claimed as a challenging academic subject
since her elementary years. Jaime's performance on a task of logical reasoning
(Matrix Reasoning subtest percentile rank = 95) fell into the High Average to
Very Superior range, while her ability to sequence socially meaningful stimuli
(Picture Arrangement percentile rank = 75) and to attend to visually detailed
missing elements (Picture Completion percentile rank = 75) both revealed an
Average to High Average capacity.
Jaime's Working Memory Index score from the WAIS-III fell into the
Average to High Average range (percentile rank = 73; percentile rank range =
55-84), commensurate with current ability estimates. Her performance on the
Letter-Number Sequencing subtest (percentile rank = 63; Average to High
Average range) was slightly less developed than her performance on the Digit
Span subtest, which fell in the High Average to Very Superior range (percentile
rank = 95). Both areas were better developed than her Arithmetic subtest (per-
centile rank = 37; Average), a traditionally poor area of achievement for Jaime.
Overall, little distractibility in the auditory channel appears to exist.
Jaime's Processing Speed Index score from the WAIS-III fell in the
Borderline to Average range (percentile rank = 18, percentile rank range =
8-42), very significantly below current ability estimates, given Jaime's Average
to High Average intellectual capabilities — a 32-point discrepancy. Jaime's psy-
chomotor speed and short-term visual memory (Coding subtest) and her speed
in processing visual information (Symbol Search subtest, which does not have
a memory component) both fell into the Borderline to Average range. These
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 79
results are important, because distractibility frequently shows up in a client's
cognitive profile as a short-term memory deficiency. As will be seen in the WJ-
III fluency testing that follows below, Jaime displays a processing speed defi-
ciency. In addition, these and previous assessment results documented an intra-
personal weakness in short-term visual memory.
As a second example, a more contextual description can sometimes help those
who will work with the client better understand the client's current situation:
Because a question arose regarding whether Ben possessed significant prob-
lems with inattention, clinical and behavioral assessments focused on the pres-
ence of age- and ability-inappropriate levels of distractibility, a primary symp-
tom of an Attention-Deficit/Hyperactivity Disorder (AD/HD). Miss Wallace
(2nd-grade teacher), and Mrs. Davis (reading teacher), educators who have in-
structed Ben and who are well acquainted with his academic and behavioral
performance, completed the Conners' Teacher Rating Scale — Revised, long
Version (CTRS-R:L). Miss Wallace also completed the Acbenbach System of
Empirically Based Assessment {ASEBA) Teacher Rating Form (TRE). Mr. and
Mrs. Smith completed the Conners' Parent Rating Scale — Revised, long Version
(CPRS-R:!). Mr. and Mrs. Smith and Mrs. Davis reported substantial con-
cerns related to inattention and disorganization; Miss Wallace did not. All were
in agreement that Ben displayed the following behaviors associated with inat-
tention to a significant degree: forgets things he has already learned and has
difficulty engaging in tasks requiring sustained mental effort. In addition, Mr.
and Mrs. Smith and Mrs. Davis agreed that Ben frequently fails to give close
attention to details and makes careless mistakes, has difficulty organizing tasks
and activities, and is easily distracted by extraneous stimuli. Mrs. Smith and
Mrs. Davis also agreed that Benjamin frequently does not seem to listen to
what is said and has difficulty sustaining attention on tasks. Finally, Mr. and
Mrs. Smith agreed that Ben does not follow through on instructions, fails to
finish assigned work, and loses things necessary for tasks and activities. Each
of these behaviors is a criterion for diagnosis of AD/HD — Predominantly
Inattentive Type, and Ben fulfills the diagnostic criteria for this condition. All
other behavioral and personality characteristics were reported to be within nor-
mal limits, although some concern over social relationships and development
was expressed by Miss Wallace.
Such descriptions not only provide contextual understanding, but can aid treat-
ment planning and outcomes evaluation.
Think About It 5.2 Why is it important to demonstrate a client's scores
as a range instead of as an individual score? How would you explain this
process to a client?
1 80 Chapter 5
CRITERION-REFERENCED INTERPRETATION
Single-Skill Scores
As mentioned previously, norm-referenced scores compare an examinee's perform-
ance to other individuals in the norm group who share similar characteristics.
Criterion-referenced scores, on the other hand, compare the examinee's scores against
an absolute standard (i.e., criterion) of performance. In other words, this form of
testing measures levels of mastery. As such, performance on criterion- referenced test-
ing is often helpful in making important instructional decisions regarding the mas-
tery of specified curriculum goals and objectives or diagnostic decisions when a cer-
tain number of criteria or level of severity is required. Criterion-referenced
interpretation is often divided into two categories: single-skill scores and multiple-skill
scores.
Single-skill scores can be obtained for almost any target measured against an estab-
lished criterion. However, most single-skill targets are related to academic, occupa-
tional, or social domains. For example, an educator may score a math problem
worked by a student. A vocational rehabilitation counselor may evaluate the feeding
ability of an individual who recently experienced a stroke. An observer may note the
number of adult instructions with which a referred child complies. Scoring can be
dichotomous (e.g., pass-fail, right-wrong) or continuous (i.e., allowing partial credit
for the item). In this case, each point on the continuum (e.g., never, seldom, often,
always) would have to be carefully defined. In single-skill probes, raw scores are often
transferred into a ratio. For example, an examinee may correctly complete 40 of 50
items on a test. Therefore, the score would be represented as 40/50.
Multiple- Skill Scores
Many activities are not comprised on single-skill units but contain multiple skills.
For example, measures of oral reading involve decoding or words, fluency, knowl-
edge of grammatical rules, and, often, comprehension of material read. Additionally,
educators often obtain answers to several questions on a mathematics exam com-
posed of varying calculations (e.g., addition, subtraction, multiplication, division)
rather than an answer to one problem. Multiple-skill scores are often divided into
three areas of reporting: accuracy, retention, and verbal labels for percentages (Salvia &
Ysseldyke, 2004).
Accuracy
An accuracy percentage is obtained by dividing the number of correct responses pro-
vided by the examinee by the total number of items and then multiplying by 100.
For example, a student who correctly responded to 9 out of 1 items on a test would
receive a percent correct score of 90 (9/10 x 100 = 90%). Although educators often
convert raw scores into this format to report student outcomes, remember that such
scores are not equivalent in the same way as standard scores. A score of 90% on a
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 181
mathematics test is not the same as a score of 90% on a spelling test, because the
subject content is completely different and the items are presented in different for-
mats as well. Note that the score of 90% does not allow for comparison of scores.
The 90% could be the highest or lowest score in a distribution, and without access
to the distribution of scores, further comparative analysis is hindered.
Retention
Retention refers to the percentage of information previously learned that is remem-
bered at a later date. It also has been referred to as recall, memory, or maintenance.
Retention is calculated by dividing the initial number of items remembered by the
total number of items initially learned and then multiplying by 100. For example, an
examinee may have learned 50 new words and recalled 40 of them two weeks later.
This examinee's retention would be 80% (40/50 x 100 = 80%).
Percentages expressed as verbal labels
Sometimes percentages are expressed as labels. Two methods in which percentages
are expressed as labels include level of performance and grades. Level of performance
is often divided into two levels: mastery level and instructional level (Salvia &
Ysseldyke, 2004). In many educational contexts, mastery is set at 90% or above, and
nonmastery is set at any percentage below 90%.
Instructional level is further divided into frustrational, instructional, and inde-
pendent levels of performance. Frustrational-level performance is usually defined as
less than 85% correct. Instructional-level performance is defined as 85-95% correct.
Independent-level performance is defined as above 95% correct.
Grades have also been used as verbal labels for percentages. For example, many
college professors use a grading scale in which any one scoring 90—100% correct
would receive a grade of an "A," anyone scoring 80-89% correct would receive a
grade of a "B," and anyone scoring 59% or below would receive a grade of "F."
SOURCES OF INFORMATION ABOUT TESTS
Selection of an assessment tool is an important clinical decision and a vital part of
the counseling process. The information gained from the assessment itself often
serves as the foundation of counseling as it gives the professional counselor much
necessary information that will aid in determining therapeutic goals and interven-
tions and which will be a great asset in measuring progress and outcomes.
Professional counselors must carefully choose instruments that are designed to
address the referral questions and which are appropriate to their levels of education
and training. However, the amount of assessments available today can prove over-
whelming, and many professional counselors find themselves relying on less than ap-
propriate tools simply out of habit or lack of information. This kind of choice is not
necessary given the resources available to help make an informed decision regarding
assessment selection. Although there is no one source that contains every assessment
tool developed, there are a variety of sources that professional counselors should
1 82 Chapter 5
Table 5.6 Evaluation of sources of information about tests
Source type
Advantages
Disadvantages
Test manuals
Publisher catalogs
Test review volumes
Journals
Textbooks
Electronic sources
Usually contain much information about
theoretical basis, item development, reliability,
validity, standardization, and norms. Are often
the best single source.
Provide current information on tests, even on
new tests not found elsewhere. Give costs and
ordering information.
Offer critical reviews by experts, with evaluation
of weaknesses and strengths of each test.
Give research on issues in testing. Often show
application of test. Contain validity and
reliability studies.
Give in-depth information on certain tests.
Provide an overview of tests in general.
Give easy access to current information. Are easy
to search by subject matter. Provide links to
other sources.
Test authors vary in comprehensiveness and
psychometric sophistication. External empirical
validation of results is not available for years after
publication of the manual.
Information may be biased. Necessary basic
information is often lacking.
Information is often dated. Reviews often do not
include a thorough discussion of purposes of test.
Information is often theoretical and technical and
may be dated due to publication backlog.
Information may be biased, dated,
oversimplified, or technical.
Information may be biased or incongruent in
presentation. Access to information may be
difficult for some.
search on a regular basis. To help in this process, several sources are listed below (and
in Table 5.6) that provide assistance in selecting and evaluating tests most suitable
and technically sound.
Think About It 5.3 When deciding which test to administer to a client,
why would it be important to thoroughly research the test using several of
the resources described in Table 5.6?
Published Resources
One of the most basic and essential assessment resources is the Mental Measurements
Yearbook (MMY). First published in 1938 by Oscar K. Buros, this series of yearbooks
is currently published by the Buros Institute of Mental Measurements of the
University of Nebraska — Lincoln. This series of yearbooks contains thorough cri-
tiques of many commercially available instruments. Each A/A/Yincludes descriptive
information about each test, including the purpose of the instrument, for whom the
instrument is appropriate, cost, and the publisher. Additionally, the yearbooks con-
tain critical reviews of each instrument, written by knowledgeable professionals.
These reviews contain the strengths and weaknesses of the instrument.
Another resource published by the Buros Institute is Tests in Print, which is es-
pecially useful for quickly identifying which instruments are most appropriate for a
PRO-ED
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 83
Table 5.7 What to include in a test critique
1. Exact name of the instrument (or technique)
2. Author (person, organization, or company)
3. Publisher
4. Copyright date(s)
a. Date first published
b. Date(s) of revision (s)
c. Date of version being reviewed
5. Purpose and recommended use
6. Appropriate respondent characteristics (e.g., age, grade, reading level, mental abilities,
physical characteristics)
7. Available forms
8. Current cost information
9. Content
a. Categories assessed or measured
b. Types of items used
c. Type(s) of responses required
10. Administration procedures and requirements
1 1 . Time factors and considerations
12. Administtator qualifications
13. Interpreter or user qualifications
14. Scoting options and procedures
15. Type(s) of scores derived or reported
16. Normative data
17. Validity information
18. Reliability information
19. Statistical information other than validity or reliability
20. Multicultural issues
21. Evaluation
a. Limitations for use in counseling or student development
b. Advantages for use in counseling or student development
particular content area or for a particular use. Once a test is located, the professional
counselor can then cross-reference it with the more thorough descriptions found in
the MMY. For information relevant to critiquing tests, see Table 5.7.
While the Buros Institute has several prominent resources, PRO-ED, Inc., has use-
ful sources for locating and evaluating tests. One, Tests: A Comprehensive Reference
for Assessments in Psychology, Education, and Business (PRO-ED, 2003), contains
more than 3,000 published tests. Each test listed includes a brief description, a state-
ment of its purpose, and information regarding cost, scoring, and the publisher.
Tests, though not reviewed, are easily accessed through the classifications and cate-
gories used to organize the resource.
For reviews and evaluations of tests, PRO-ED provides Test Critiques, a series of
volumes containing test critiques written by measurement and assessment experts.
1 84 Chapter 5
Publisher Catalogs
Each critique includes emphasis on information that will aid the professional coun-
selor using the test, such as guidelines for administration, scoring, and interpreta-
tion. Especially helpful are the explanations of technical terms that will make the in-
formation more understandable, even to those with little testing experience.
Some test publishers will send catalogs upon request; others are available online.
Catalogs can be especially useful for locating new tests and recent editions of previ-
ously published tests — information that sometimes cannot be found in the sources
discussed above. These catalogs provide information regarding uses of the test, cost,
administration time, and other brief descriptions.
Professional Journals and Textbooks
Electronic Resources
Other sources of information include professional journals and some textbooks.
Journal articles often contain test reviews and may discuss the nature and use of par-
ticular tests. These articles can be most easily located through electronic databases.
The professional counselor will find many journals very helpful in finding current
extant research on commonly used assessments, including Measurement and
Evaluation in Counseling and Development, Educational and Psychological
Measurement, Psychological Reports, Psychological Assessment, and. Assessment for
Effective Intervention. Recently, desk references of different types of tests (e.g.,
Achievement Test Desk Reference [Flanagan, Ortiz, Alfonso, & Mascolo, 2002],
Intelligence Test Desk Reference [McGrew & Flanagan, 1998]) have been published to
assist examiners in selecting appropriate instruments. In addition, some textbooks
contain appendices with lists of widely used testing instruments. However, texts such
as this one mainly supply a brief overview of available instruments.
Several resources exist to help the professional counselor identify and locate
appropriate assessments in an efficient manner. When searching for an instrument
that will test a specific content area, Tests or Tests in Print will provide quick infor-
mation that can then be further explored in the Mental Measurements Yearbook.
When additional information is needed to help in understanding the specific me-
chanics of a test, Test Critiques may prove beneficial. Both of the latter resources
provide sufficient information to weed out tests that are inappropriate or which
have obvious weaknesses. Although all of the above publications provide compre-
hensive coverage of available tests, catalogs, journals, and textbooks can also prove
useful.
Changes in technology have greatly improved access to possible assessment instru-
ments. The Buros Institute of Mental Measurements provides test information
through an electronic source, in addition to its printed version. Various search en-
gines are available chat allow viewing of a large amount of information on tests and
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 85
testing. "Test Reviews Online" is a web-based service of the Buros Institute of Mental
Measurements, available at www.unl.edu/buros, and makes test reviews available to
individual users exactly as they appear in the Ninth through Fifteenth Mental
Measurements Yearbook series. In addition, monthly updates are provided from the
institute's latest test review database. For a small fee, users may download reviews for
over 2,000 tests that include specifics on test purpose, population, publication date,
administration time, and descriptive test evaluations.
Another service of the Buros Institute is Tests in Print. Tests in Print ( TIP) can
be accessed through the above website and serves as a comprehensive bibliography to
all known commercially available tests that are currently in print in the English lan-
guage. Now in its sixth edition, TIP provides vital information to users, including
test purpose, test publisher, in-print status, price, test acronym, intended test popu-
lation, administration times, publication date(s), and test author(s). 77Palso guides
readers to critical, candid test reviews published in the Mental Measurements Yearbook
series.
The Educational Testing Service (ETS) offers an electronic source for its test col-
lection as well. The ETS Test Collection includes an extensive library of 20,000 tests
and other measurement devices from the early 1900s to the present. The collection
is advertised as the largest in the world and was established to make information on
standardized tests and research instruments available to researchers, graduate stu-
dents, and teachers. The ETS database can be accessed at www.ets.org/testcoll. From
there, one can search by topic for instruments with each result providing descriptive
information. Orders can also be placed at this site.
PRO-ED has a useful source for locating and evaluating tests at
www.proedinc.com/store/index.php. Through PRO-ED's online catalog products,
available assessments can be located by a topic, title, or author name search. This
search will give results with brief descriptions of each test, including price of test,
materials included in each testing kit, and an option to place an order.
Finally, another valuable electronic source can be found at http://aace.ncat.edu.
This website is the home page of the Association for Assessment in Counseling and
Education, a division of the American Counseling Association. Through AACE's
"resources" option, professional counselors can find invaluable links to the ERIC test
locator, some test reviews, assessment journals, and key documents such as Ethics in
Assessment, Standards for Qualifications of Test Users, and Rights and Responsibilities of
Test Takers: Guidelines and Expectations.
COMMON ERRORS
Regardless of level of training or expertise, professional counselors are human and
are therefore susceptible to committing errors during the testing process. In inter-
preting assessment instruments, professional counselors sometimes may commit in-
ference and attribution errors. Although the assessments provide basic information
about the client, the professional counselor must then sort the information and for-
mulate overall conclusions and implications. While much is known about how to
develop and evaluate psychological tests, much less is known about how to use the
1 86 Chapter 5
information generated. By familiarizing the professional counselor with common
errors, it is hoped that these errors will be minimized in test interpretation and de-
cision making.
The tendency to seek confirmatory evidence (confirmatory bias) is one of the most
common mistakes in test interpretation. Humans are prone to self-confirmation and
often search for confirmatory information. In other words, one often believes what
one wants to believe. Research supports this claim and shows that the human ten-
dency is to search out and attend only to evidence that conforms to one's hypothesis.
Though professional counselors have been trained to attend to all information in clin-
ical decision making, they are just as prone to attend to narrow paths of evidence.
Because of this tendency, professional counselors often conclude what they already
suspect. This process of searching for confirmation can lead to inaccurate conclusions
and may lead to an increased confidence in one's conclusions and abilities. Some evi-
dence suggests that beginning counselors are particularly subject to confirmatory bias,
thinking they understand the problem before they really do and, thus, working on
the wrong problem.
A second error commonly made is the tendency to see patterns where no patterns
actually exist. Because humans strive for predictability in life, we are prone to attrib-
ute order to ambiguous information. This tendency can have implications in test in-
terpretation, as themes and patterns may be said to exist where none have actually
emerged.
Finally, the use of preconceived biases is a form of error commonly found in test
interpretation. Primarily, there is a tendency to overpathologize clients. Professional
counselors are prone to search for information indicative of pathology and then in-
terpret this information in a way that indicates more pathology than may actually
exist. This tendency is exaggerated when the client is from a lower social class, non-
white, disabled, or female.
Professional counselors must be aware of these common errors throughout the
assessment process, as inaccurate decisions regarding clients can be easily made. The
use of quality information provided in psychological assessments is not enough to
remedy the errors involved in the interpretation process. Given these concerns, the
following recommendations are provided:
■ Do not confuse the ability to explain current data with the ability to predict fu-
ture performance.
■ Continue to assess skills over time instead of relying on one evaluation of the ex-
aminee's performance.
■ Collect data from multiple sources. Do not rely solely on self-report or observa-
tion of one informant.
■ Consider all other possibilities, and rule out alternative hypotheses.
■ Choose the highest quality and most appropriate assessment instruments.
■ Recognize personal biases, especially those pertaining to age, gender, class, and
ethnicity.
■ Be aware of the norms used during test construction, as well as the differences be-
tween the client and the norm group used.
Selecting, Administering, Scoring, and Interpreting Assessment Instruments and Techniques 1 87
As humans, professional counselors must continually strive to overcome any po-
tential biases or attribution errors that may affect their decisions regarding client per-
formance. While psychological tests improve the accuracy of decision making, care
must still be taken in their interpretation and application.
SUMMARY/CONCLUSION
Testing involves administering questions to an individual or individuals in order to
obtain a score. Assessment differs from testing in that it includes such processes as in-
terviewing, records review, observations, rating scales, standardized testing, and
many other provisions that create a larger process. When administering tests, take
into account the test qualifications specified by the test's manual. Also, professional
counselors should ensure that examinees are prepared and familiar with the proce-
dures and process of testing before beginning testing.
The testing environment is a very important aspect to consider when one wishes
to obtain accurate results. One should always try to strictly follow time specifica-
tions, directions, registration and identification procedures, and any other proce-
dural guidelines laid out by the test's manual. If any deviation from the specified pro-
cedures occurs during testing, it should be thoroughly documented by the examiner.
Many factors can affect test scores. First, the examiner-examinee relationship
should be one that is neutral. Reinforcement or negativity during testing can greatly
affect scores. Professional counselors should always take into account individual dif-
ferences when administering tests and interpreting scores. Furthermore, expectancy
of the examiner can affect test scores.
Scoring a test can allow quantification of scores and aid in interpretation.
Several formats for scoring tests exist. Tests can be self-scored, scored by others, or
scored by computers. While computer scoring is the most accurate form of scoring,
computers are often incapable of making the judgments required for test interpreta-
tion.
Norm- referenced interpretation involves comparing the obtained score of the
examinee to the norm group. These scores can be expressed in developmental equiv-
alents such as age equivalents, which compare the examinee to others of the same
age, and grade equivalents, which compare the examinee to others of the same grade
level. There are many problems with making comparisons like those made in devel-
opmental equivalents, and interpretation should be done with care. In order to in-
terpret developmental equivalents, the test interpreter compares the examinee's
chronological and mental age to obtain a developmental quotient.
Scores on the same test for several different examinees of different ages can be
compared by using scores of relative standing. Common types of scores of relative
standing are standard scores, which have a designated mean and standard deviation.
T scores, z-scores, deviation IQs, normal-curve equivalents, and stanines are com-
mon types of standard scores.
Criterion-referenced scores compare the examinee's scores against an absolute
standard (i.e., criterion) of performance. Types of criterion-referenced scores include
1 88 Chapter 5
KEY TERMS
single-skill scores, which assess a solitary academic, occupational, or social domain,
and multiple-skill scores, which measure any area that is compowsed of several skills.
Multiple-skill scores can be reported by expressing accuracy, retention, verbal labels
for percentages, and instructional-level scores.
There are many available sources of information on tests, test administration,
test scoring, and psychometric properties of tests. The chapter covers in detail the
many published sources of this information and electronic information.
Lastly, professional counselors should take into account sources of error in test-
ing and assessment. First, although professional counselors are trained to take all
sources of information into account, they often make mistakes. Such mistakes in-
clude overpathologizing clients, seeking to confirm their hypotheses with more evi-
dence, and recognizing patterns that may not actually be present. The chapter in-
cludes a series of steps and precautions to avoid this type and other types of error.
developmental equivalent standardization sample
deviation IQ standard score
normal-curve equivalent stanine
percentage test score
percentile rank T score
raw score z-score
scores of relative standing
CHAPTER
6
How Tests Are Constructed
by Carl J. Sheperis, Carey Davis, and R. Anthony Doggett
This chapter provides readers with preliminary information related to the con-
struction and evaluation of psychological and educational tests, including: the
purposes of tests; observables; item generation (multiple-choice, essay, true-
false); technical analysis (item difficulty, item discrimination); and norms. The chap-
ter also addresses the process of building quality tests that are aimed toward promot-
ing valid score interpretation, and how to evaluate the use of a specific test for a
specific purpose. Finally, the chapter reviews the fundamentals of test development,
how to choose among already existing tests for a specific purpose, how to use the re-
sults of standardized tests to help make decisions about individuals, and how to iden-
tify flaws in assessment instruments and procedures.
Many of you reading this book may be highlighting or underlining certain
words or phrases to help yourself remember key information you might encounter
on the next exam. As you study for that exam, you might also want to ask the in-
structor some questions to help yourself prepare. First, you might ask the purpose of
the test (e.g., the objectives of the test, the way it will be scored, and how the results
will affect your final grade). Next, you might ask what content the test will cover
(e.g., the chapters to be covered on the test and whether the questions will require
memorization of facts or application of knowledge) . Finally, you might ask what the
format of the test items will be (e.g., multiple-choice, short-answer, essay).
When instructors are constructing a test, they, consciously or unconsciously,
will be asking and answering similar questions: "What is the purpose of the test?"
"How do I assess the content to be covered by the test?" and "How should I write
the items on the test?" Similarly, identifying the purpose of a test, observables related
to the test, item generation procedures, and test format are critical components of any
189
1 90 Chapter 6
test construction process, whether the test is a simple one to be used in an elemen-
tary school classroom, an examination for a graduate-level course, or a published
psychological assessment instrument. However, appropriate test construction does
not stop when items are developed. Development of a quality test requires appro-
priate statistical analyses to determine item difficulty and item discrimination. Some
tests, such as published psychological instruments, also use norms to help test users
interpret test results. Each of the above concepts related to test development is dis-
cussed throughout this chapter. The brief introduction to this material given in
this chapter, however, will not provide adequate guidance to become a seasoned
test developer; those interested in learning more about test construction should see
Crocker and Algina (1986).
PURPOSE OF THE TEST
The first step in test construction is to define the general purpose of the test. The in-
structor probably defined the general purpose of your next test on your syllabus (e.g.,
the test may assess class members' knowledge of the information from Chapters 1
through 6 of this textbook and be worth 40% of your final grade in the course).
Although the general purpose of a published test must be more formally defined
than that of your next classroom exam, the basic principles are the same. Test devel-
opment addresses the population taking the instrument (i.e., the members of the
class) and the content of the test (i.e., knowledge of Chapters 1 through 6).
Although course-related tests provide a very basic example of test construction,
for a standardized test, the content of the test and the theory on which the test is
based may be considerably more complex. There are many questions related to test
purpose that the instructor does not necessarily need to consider when writing a
course-related test — questions that are, however, crucial in constructing many other
types of standardized tests. Test developers must consider such issues as whether a
test will be norm referenced or criterion referenced, what objectives will be meas-
ured, how items and scores will be scaled, and what approach to test construction
will be used. Cohen and Swerdlik (1999, pp. 216-218) suggested that test develop-
ers need to consider at least the following 14 questions prior to developing a test:
1 . What is the test designed to measure?
2. What is the objective of the test?
3. Is there a need for this test?
4. Who will use this test?
5. Who will take this test?
6. What content will the test cover?
7. How will the test be administered?
8. What is the ideal format of the test?
9. Should more than one form of the test be developed?
10. What special training will be required of test users for administering or inter-
preting the test?
1 1. What types of responses will be required by test takers?
12. Who benefits from the results ol this test?
How Tests Are Constructed
191
Examinees
Goals and Theory
13. Is there any potential for harm from administration of this test?
14. How will meaning be attributed to scores on this test?
In addition, the question "How does the test address multicultural/diverse popula-
tions?" must be asked.
For several reasons, it is important to define who will be in the normative sample
when constructing a test. First, the age range of the test takers will be a factor in de-
termining the content and how that content will be assessed. Also, the reading abil-
ity of the test takers will affect the way the items are written and whether the test
will be presented in written or oral form. Additionally, the cultural backgrounds of
examinees may influence the items that are included on the test and the way items
are presented. Finally, it is important to identify who needs to take the test and/or
who would want to take it (Cohen & Swerdlik, 1999).
The goals of any test are inherently based on a theory. For example, a typical class-
room test is probably based on the theory that if the examinee is able to answer a
certain percentage of questions correctly, the examinee is competent in knowledge of
the course content. In this case, knowledge of course content is theoretically related
to test performance. Standardized tests are often more complex, because test devel-
opers writing an intelligence test would first have to choose a theory of intelligence
on which to base the instrument. Likewise, test developers writing a personality test
would have to define the aspects of personality the test would purport to measure.
The theory on which a test is based links the content of the test to the constructs,
characteristics, or attributes that the test is designed to measure.
Norm Referenced or Criterion Referenced
Once the theory that underlies the purpose of the test has been clarified, the next step
in the test construction process is to decide whether a test should be norm referenced
or criterion referenced. A norm-referenced test is one in which an individual's score is
interpreted by comparing it with other individuals' scores (i.e., a normative sample);
a criterion-referenced test is one in which an individual's score is interpreted in terms of
a predetermined criterion of demonstrated skills (i.e., objectives) (Mehrens &
Lehmann, 1991). A test developer's decision about whether a test should be norm ref-
erenced or criterion referenced must be based on the purpose or goal of the test
(Hopkins, 1996). For example, if a test is designed to assist employers in choosing
from a large pool of potential employees, its goal should be to make comparisons
among the candidates; therefore, a norm-referenced test would be appropriate. On
the other hand, if the purpose of a test is to help a teacher determine whether individ-
ual students have mastered certain instructional objectives in order to identify the ones
who need additional tutoring in specific areas, a criterion-referenced test would be
1 92 Chapter 6
Objectives
beneficial because it would yield information about the areas in which the students
needed help instead of just comparing the students to each another (as a norm-refer-
enced test would do). There are times when a test may be both norm referenced and
criterion referenced. When you take your next test, your instructor will probably give
you a grade based on a predetermined criterion, such as the number of questions you
must answer correctly in order to pass the test. Such a grade would indicate that the
test is to be criterion referenced. However, if your instructor gives you information
about the class average on the test, enabling you to compare your score to the scores
of your classmates, the test could become not only a criterion-referenced test but also
a very simple norm-referenced test.
Test developers who write criterion-referenced tests must carefully consider objec-
tives when writing their tests. The terms objectives and goals may easily be confused,
but in this discussion, the objectives refer specifically to instructional objectives meas-
ured by criterion-referenced tests, whereas goals have a broader reference, applying to
many types of tests. For example, when instructors write a class test (which is very
likely to be an informal criterion-referenced test), they look at the objectives listed on
the syllabus and write the test so that it measures those objectives; the goals of the
test are much broader — primarily to determine whether students have mastered
course content well enough to pass the course. When considering the objectives to
be tested, test developers must take several factors into account. First, the specificity
of the objectives will affect the way the test items are written (Hopkins, 1996). Also,
Hopkins contended that tests that measure educational objectives must define these
objectives in terms of "Bloom's taxonomy," which categorizes objectives into six
hierarchical levels: knowledge, comprehension, application, analysis, synthesis, and
evaluation. Objectives are important to consider in criterion-referenced tests, but not
all tests measure objectives. For example, a personality test does not measure whether
an individual has attained mastery of a certain personality type; instead, it measures
a person's personality type. Many norm-referenced tests do not measure whether in-
dividuals meet certain objectives.
Scaling
Another issue that test developers must consider is scaling, which is "the process by
which a measuring device is designed and calibrated, and the way numbers (or other
indices) — scale values — are assigned to different amounts of the trait, attribute, or
characteristic being measured" (Cohen & Swerdlik, 1999, p. 219). In other words,
scaling is basically attaching numbers to the construct that the test is theorized to
measure. There are cases in which scaling is fairly simple. On your next test, each
question will probably be assigned a point value, and your score will reflect the num-
ber of questions you answer correctly. The example of your next test represents a
summative scale, in which correct responses arc added together (summed) to calcu-
late the final score.
How Tests Are Constructed 1 93
The example of the scaling for your next test is fairly straightforward; however,
scaling can be an extremely complicated process. Scales may be defined in several
different ways. For example, scales may be defined by whether they are nominal, or-
dinal, interval, or ratio. Scales may also be defined by whether they are rating scales
or comparative scales or by whether they are unidimensional or multidimensional. For
example, some tests use rating scales, which require examinees to rate test items (i.e.,
"On a scale of 1 to 10, with 1 being poor and 10 being excellent, rate the service you
received from your waiter"). On some tests that use such rating scales, the ratings are
summed for the final score; therefore, they are summative tests (Cohen & Swerdlik,
1999). Rating scales may take many forms. In some instances, true-false tests may be
considered rating scales (i.e., "I felt depressed this morning. Circle one: True/False),
or rating scales may be written as a series of faces — such as a sad face, a medium face,
and a happy face — that examinees should circle. A very popular type of rating scale
is the Likert scale, which allows examinees to choose from a continuum of five re-
sponses, usually with Agree or Approve on one end of the continuum and Disagree
or Disapprove on the other end. Comparative scales are somewhat similar to rating
scales. When comparative scales are used, an examinee might be given items to sort
or rank in a certain order (i.e., from most to least appealing, or from worst to best).
Another way of defining a scale is whether it is unidimensional or multidimen-
sional. Unidimensional scales are those in which numbers are assigned only to one di-
mension; multidimensional scales are those in which several different dimensions may
underlie the examinee's responses (Cohen & Swerdlik, 1999). For example, if a re-
sponse to a test item may be interpreted in many different ways, it is likely that the
item is part of a multidimensional scale. All of the scales mentioned to this point
yield ordinal scores.
Two other types of scales are the Guttman scale and the Thurstone scale. The
Guttman scale is an ordinal scaling method in which items are arranged to form a hi-
erarchy, so that an examinee who agrees with or confirms one item on the hierarchy
also agrees with or confirms the items lower than that item on the hierarchy but dis-
agrees with or disconfirms the items higher than that item on the hierarchy. The
Guttman scale is also called the deterministic or monotone model. Thorndike (2005,
p. 393) gave the following example of a Guttman scale:
1 . Abortion should be available to any woman who wishes one.
2. Abortion should be legal if a doctor recommends it.
3. Abortions should be legal whenever the pregnancy is the result of rape or incest.
4. Abortion should be legal whenever the health or well-being of the mother is en-
dangered.
5. Abortion should be legal only when the life of the mother is endangered.
Such a graduated scale presumes that a respondent selecting response choice 1
also agrees with the conditions listed in choices 2 through 5. Conversely, an individ-
ual selecting choice 5 would be presumed to not agree with choices 1 through 4.
The Thurstone scale is a scaling method that yields interval data (Cohen &
Swerdlik, 1999). In this method, items are rated by a group of judges, and means
and standard deviations of the judges' ratings are calculated for all of the items.
194 Chapter 6
Then, items on which most judges agreed (or items with low standard deviations) are
included in the test. Finally, the examinee rates the items, and the examinee's score
is determined by the judges' ratings of the items the individual selects. The
Thurstone scale is also called the probability or nonmonotone model or the equal-
appearing interval model. The type of scale that is used in a test should be selected
according to the variables being measured and the examinees for whom the test is
intended.
Approaches to Test Construction
After a test developer has defined the general purpose of the test, identified the ex-
aminees who are to take the test, described the theory on which the test is based, de-
cided whether the test will be norm referenced or criterion referenced, outlined the
objectives that will be measured, and selected a scaling method, the developer must
choose an approach to test construction. Approaches to test construction can be di-
vided into three basic categories: the rational approach, the empirical approach, and
the bootstrap approach (Janda, 1998).
Test developers who choose the rational approach rely on reason and logic to
create items instead of relying on collecting data for statistical analysis when con-
structing items (Janda, 1998). The rational approach is also called the theoretical ap-
proach because the test developers are theorizing that the items are related to the con-
structs they are attempting to measure (Hansen, 1999). Your instructor will probably
use the rational approach when constructing your next test. In contrast, test devel-
opers who choose the empirical approach rely on data collection to identify items
that relate to the construct they are attempting to measure. In this approach, items
are developed randomly, and whether items are used is based on the data gathered
when the items are administered to a pool of examinees participating in the test con-
struction process (Janda, 1998). Two different methods used in the empirical ap-
proach are the method of contrast groups (in which items are examined based on the
different responses of two or more groups of people who are selected because of cer-
tain characteristics that each group has in common) and the method of item cluster-
ing (in which factor analysis is used to identify which items correlate with one an-
other) (Lichtenberg, 1999). The bootstrap approach is a combination of the rational
approach and the empirical approach in that items are written based on a theory (in-
stead of randomly), and then empirical procedures are used to verify that the items
actually measure the construct they are theorized to measure (Janda, 1998). Another
name for the bootstrap approach is the sequential method (Lichtenberg, 1999).
A Test Development Example
The reader now has a basic understanding of many of the decisions that a test devel-
oper must consider in order to thoroughly delineate the purpose of the test. General
examples of the concepts have been provided, but a more specific example may give
a clearer picture of this crucial step in the test construction process. The Black
Adolescent Racial Identity Scale (BAR/S) (see Figure 6.1) constructed by Sheperis
(2001) serves as an example demonstrating the development of a test purpose.
How Tests Are Constructed
195
BARIS
Instructions: Each item may or may not be true for you. To the right of each item is a set of choices that
describes how you think about the item. Select one of the choices by circling the number below it:
Strongly Agree Agree Disagree Strongly Disagree
4 3 2 1
Please answer every item, and make only one choice per item. There are no right or wrong answers.
If a question does not seem to apply to you, imagine a time that it might and answer the question
based on your thought.
Sample Question:
Strongly
Agree
Agree
Disagree
Strongly
Disagree
A. I like pizza.
4
3
2
1
Queston:
Strongly
Agree
Agree
Disagree
Strongly
Disagree
1 . It is important to take part in Black activities.
4
3
2
2. Whites get more chances in life.
4
3
2
3. It is good to be around Blacks and other races.
4
3
2
4. Whites are more trustworthy than Blacks.
4
3
2
5. It is easier to get along with Black people.
4
3
2
6. People should be proud of their race.
4
3
2
7. Teenagers should only date people from the
same race.
4
3
2
8. People from all races have good things about
them.
4
3
2
9. It is good to get along with all kinds of people.
4
3
2
10. Children should know what it means to be
Black.
4
3
2
1 1 . White counselors are better than Black
counselors.
4
3
2
12. It is good to do things with people from all
types of backgrounds.
4
3
2
13. It is OK to date somebody from another race.
4
3
2
14. White friends are better than Black friends.
4
3
2
15. People from all races should get along.
4
3
2
16. It's OK for Whites and Blacks to mix.
4
3
2
17. Black counselors understand kids better than
White counselors.
4
3
2
18. It is better to have lighter skin.
4
3
2
19. Whites have nicer hair than Blacks.
4
3
2
20. It is important to belong to a Black church.
4
3
2
21 . It is good to learn about the race and
background of others.
4
3
2
22. It is better to be more like Whites.
4
3
2
1
Figure 6.1 The Black Adolescent Racial Identity Scale (BARIS)
Sheperis (2001) created the BARIS "to measure racial identity development
(RID) in Black adolescent males" (p. vii). This statement outlines the general pur-
pose of the test, including the theory basis for the goals of the test and the examinees
1 96 Chapter 6
OBSERVABLES
for whom the test is designed. Rather than simply creating a test to measure racial
identity development, Sheperis constructed the test for the ultimate goal of using the
information from the test to provide effective counseling programs for Black adoles-
cent males who are involved in the juvenile justice system. The implicit theory that
the test is based upon is twofold. First, the theory is that racial identity development
occurs in measurable statuses (defined by Sheperis) for the purposes of the test as as-
similation, self-segregation, and universal acceptance. Additionally, the theory is that
knowledge of the racial identity development of Black adolescent males would lead
to more effective counseling programs. As noted previously, the examinees are iden-
tified as Black adolescent males.
The next step that Sheperis (2001) had to consider when constructing the BARIS
was whether the test would be criterion referenced or norm referenced. Because the
purpose of the test is to compare characteristics of individuals (characteristics indicat-
ing individuals' status of racial identity development) within a specified group (Black
adolescent males), a norm-referenced test was an appropriate choice for the BARIS. As
such, Sheperis did not need to consider specific criteria or objectives that the test
would measure. However, he did need to consider the way he would go about meas-
uring the different statuses of racial identity development, but this is somewhat dif-
ferent from defining objectives and is discussed in the next section.
The next question that Sheperis (2001) had to consider was the question of the
scaling method he would use for the BARIS. He selected a 4-point scale. Individual
items were designed to reflect the different statuses of racial identity, and response
scores were summed to yield raw scores for each of the three statuses. Thus the scal-
ing method was a summative rating scale.
The final consideration that Sheperis (2001) had to take into account when
defining the purpose of the BARIS was the approach to test construction that he
would use. He used the bootstrap approach, or sequential model, which is a combi-
nation of the rational approach and the empirical approach. He wrote items based
on the theory of racial identity development after careful study of other measures of
racial identity development and then identified the items to include in the test
through empirical methods. An overview of the BARIS is provided in Box 6.1.
Now that the purpose of the next course exam is known (including more informa-
tion than you ever expected to be related to the purpose of any test), you may won-
der what content the test will cover. Of course, you know the goals of the test and
the instructional objectives that need to be mastered, but to really prepare for the
test, you need to know exactly how the instructor is going to go about measuring
whether students have met the objectives — for example, whether the test questions
will require application of knowledge through scenarios or simply straightforward
answers directly from this textbook.
The instructor's decision about how to assess the content to be covered by the
course exam is a question of observables. Observables are the specific variables and
behaviors that are observable aspects of the construct stemming from the implicit
theory. In terms of the course exam, the implicit theory is that test performance is
How Tests Are Constructed 1 97
Box 6.1 Overview of the BAR IS
The Black Adolescent Racial Identity Scale (BARIS) was developed in several
phases. Initial items for the BARIS were generated through a review of existing
racial identity development (RID) scales and with attention to the tri-status
model of racial identity development. The initial version of the BARIS, which
was subjected to expert review, contained 59 items related to three RID sta-
tuses: assimilation, self-segregation, and universal acceptance. In the initial
phase of this study, 327 participants from Mississippi school districts com-
pleted the BARIS and a feedback form. A factor analysis was used to identify
the initial factor structure of the initial BARIS version. Based on the respective
factor loadings on the three BARIS factors (i.e., assimilation, self-segregation,
and universal acceptance), 37 items were eliminated from the initial instru-
ment, leaving the 22 items comprising the final version of the BARIS.
In an attempt to establish the concurrent and divergent validity (dis-
cussed in Chapter 4) of the BARIS, a second phase of the study was con-
ducted in which the BARIS was administered to 126 Black adolescent males
from juvenile offender programs in Mississippi, Florida, and Pennsylvania.
One of three additional RID instruments was administered to subgroups of
25 participants along with the BARIS. The instruments included in this
phase of the study were the Racial Identity Attitude Scale, the Multigroup
Ethnic Identity Measure (MEIM), and the Adolescent Survey of Black Life.
In order to establish a reliability estimate, Cronbach's alpha (discussed in
Chapter 3) was computed for BARIS scores from the second phase of the
study. Demographic information related to age, racial designation, socioeco-
nomic status (SES), arrests, and involvement in the juvenile justice system
was collected from participants in the second phase of the study. The results
of this study showed statistically significant differences in scores based on de-
mographic characteristics. With regard to concurrent validity, two statisti-
cally significant correlations emerged from the analysis. Evidence of diver-
gent validity was demonstrated by the lack of statistically significant
correlations between the BARIS Assimilation and Universal factor scores and
all scales of the MEIM.
related to knowledge of course content. The answers given to the questions that the
instructor chooses to ask on the test are the specific behaviors the instructor will ob-
serve to determine whether students have mastered the course content.
Defining Observables
Test developers should use several steps to specify observables. First, they must define
the content and skills to be measuredby the test. This step is similar to defining objec-
tives for a criterion-referenced test; however, it applies to other types of tests as well.
In a criterion-referenced test, the objectives may also serve as the content of the test.
198 Chapter 6
In other types of tests, the content or skills to be measured are more difficult to de-
fine and are usually guided by the theory on which the test is based. Next, test de-
velopers must describe traits or characteristics related to the content domain in behav-
ioral terms. That is, they must decide what behaviors indicate that a person has
certain traits or characteristics and describe the way in which they will measure those
behaviors. For example, when constructing a course exam, the instructor will prob-
ably identify the behavior of answering questions as an indicator that students have
the trait of being knowledgeable of the course content; however, answering questions
is only one example of a behavior that a test developer can choose to measure. A
physical education instructor would probably not choose answering questions as the
behavior to measure whether the students were physically fit. Instead, the instructor
might choose and describe several physical tasks for the students to perform to indi-
cate their level of physical fitness. Finally, the test developer may need to perform a
job analysis, breaking the behavior chosen for observation into its smaller required
tasks and skills. For example, the instructor should recognize the tasks students must
complete to answer the questions on the next course exam (i.e., comprehending each
question, recalling the information gained in class and from the textbook, synthesiz-
ing that information to decide on a response, planning the response, and writing a
response using correct grammar and readable handwriting). By breaking the job of
answering the questions into its smaller parts, the instructor can better understand
student responses and how they reflect knowledge of the course content.
An Example of Observables
ITEM GENERATION
Using the BARIS as an example, Sheperis (2001) defined the observables of the test
through the following steps: First, he identified the content domain through consid-
eration of the theory of racial identity development and a thorough review of other
tests that have purported to measure racial identity development. The content areas
he chose to measure were assimilation, self-segregation, and universal acceptance.
Next, he defined the traits associated with the identified content areas in behavioral
terms. In this step, Sheperis (2001) classified statements of beliefs about race into
the different categories that were defined by the content areas. He identified exami-
nee behaviors as agreeing or disagreeing with the belief statements through their re-
sponses on a Likert scale. Thus, responding to the test items became the observable
behavior Sheperis used to measure examinees' status in racial identity development.
Because of the nature of the BARIS, Sheperis did not conduct a job analysis of the
test items but did conduct a factor analysis.
Now you know that the questions your instructor is going to ask you on your next
test are essentially observables. So, if the test items themselves are really small observ-
able behaviors that the instructor is choosing to determine whether students have
adequate knowledge of the course content, it follows that the instructor will proba-
bly give a great deal of attention to writing the items themselves. Likewise, students
will have main questions about the test items when preparing to study for the test.
How Tests Are Constructed 1 99
Students will probably ask how many items will be on the test and what percentages
of the test will cover the different content areas included on the test. Students may
also ask what the item format will be. These are questions that all test developers
must answer when generating test items. They must give special consideration to the
number of items to devote to certain topics or areas and the format of the test items.
Allocating Proportionate Numbers of Items
As you know, answers to test items are samples of behavior. It is important to keep
the word samples in mind. In most instances, it would be virtually impossible for a
test to thoroughly measure all aspects of a content area or construct for the simple
reason that it would be far too time consuming. Therefore, items must be chosen to
provide a representative sample of the behaviors that are included in the content area
or construct that the test purports to measure (Hopkins, 1996). Furthermore, it is
crucial that the proportion of test items devoted to each topic or area covered by the
test reflects the importance of each of the individual areas being measured.
Selecting an Item Format
After test developers have decided what proportions of the test will be devoted to
different topics or areas, they must select the format of the items. There are many
item formats from which to choose, including the free-response format, the multi-
ple-choice format, the true-false format, the Likert scale format, and many others.
The format selected depends on what the examiner wants to know and provides a
useful method for getting that information. If the test itself is well constructed, there
is no technical advantage in using any one particular format for the items; however,
test developers should choose an item format based on their own preferences, the
setting in which the test will be used (Janda, 1998), and the type of information
needed. Additionally, when choosing a format, test developers should be aware of
the advantages and disadvantages associated with different item formats. For exam-
ple, although in some instances multiple-choice formats may not be well suited to
measure a broad cognitive range, multiple-choice tests are easy to score and quick to
administer. Free-response formats may provide test administrators with more infor-
mation about the examinees' thought processes, but tests using this format are more
difficult to score and more expensive to administer (Martinez, 1999).
Descriptions of Item Formats
Item formats may be very simple, or they may be quite complex. The simplest for-
mat is the dichotomous format, in which examinees are given two alternatives they
must choose between in order to respond to each item. (Note: A true-false item is a
dichotomous test item because the examinee must choose from two possible re-
sponses — true or false.) Dichotomous formats are used not only for achievement
tests but also for personality tests (Whiston, 2005). Some advantages of the dichoto-
mous format are the ease with which tests in this format can be administered and
scored and the fact that the examinees must use absolute judgment or decisiveness
200 Chapter 6
in choosing between the responses rather than being uncertain or vague. A major
disadvantage of the dichotomous format when applied to an educational achieve-
ment test is that examinees have a 50% chance of getting an item correct, and it may
be difficult to determine whether examinees are merely guessing.
Another relatively simple item format is the polytomous format. The polyto-
mous format is much like the dichotomous format except that the examinee is given
more than two response choices. (Multiple-choice items and matching items are
items written in a polytomous format.) Advantages of tests that use the polytomous
format include ease of administering and scoring. Also, compared with the dichoto-
mous format, it is less likely that an examinee will get a correct answer by guessing
on an item written in the polytomous format. The polytomous and dichotomous
formats are used for all types of tests and are sometimes referred to collectively as the
selected-response format (Cohen & Swerdlik, 1999).
Both the dichotomous format and the polytomous format are item formats that
an instructor may use on your next test because they are well suited to achievement
tests. An item format that the instructor is not likely to use is the Likert format, de-
scribed earlier in this chapter, because it also represents a scaling method. As you re-
member, the Likert format requires examinees to indicate whether or not they agree
with a statement or question by selecting from five choices that represent a contin-
uum from Agree to Disagree. The Likert format is often used for personality, atti-
tude, career, and aptitude tests (Whiston, 2005).
Another item format available to test developers is the category format. This for-
mat is very similar to the Likert format in that examinees are asked to rate items;
however, examinees are given more choices for an item written in the category for-
mat than they are given for an item written in the Likert format. For example, in-
stead of having 5 choices representing the continuum, examinees may have 10
choices (give or take a few). Giving examinees more choices along a continuum al-
lows them to make finer distinctions in their ratings of the items (Whiston, 2005).
Two other item formats that are sometimes used in personality tests are the
checklist format and the Q-sort format. The checklist format requires examinees to
read through a list of words or statements and check the ones that describe them-
selves or their opinions, beliefs, or attitudes. Effectively, there are two possible re-
sponses an examinee may choose for each item: checked (applies to examinee) or
not-checked (does not apply) (Whiston, 2005). The Q-sort format allows examinees
to describe themselves or others. Examinees are given statements and asked to sort
them into a specified number of piles (e.g., nine) to indicate the degree to which they
apply to the person they are describing. Examinees would place statements that did
not apply in pile 1 and statements that definitely applied in pile 9.
A final item format test developers may choose to use is the constructed-response
format (also called the free-response format) (Janda, 1 998), which requires examinees to
construct their own responses instead of choosing from a selection of responses. There
are three types of constructed-response items: the completion item, the short-answer
question, and the essay question (Cohen & Swerdlik, 1999). The completion item re-
quires an examinee to respond by supplying a word or phrase to complete a sentence.
You may know completion items as fill-in-the-blank items. The short-answer question
requires examinees to respond by writing a short answer to a question (probably no
How Tests Are Constructed 201
longer than a paragraph and possibly as shorr as a single word). The essay question also
requires an examinee to write an answer to a question; however, in most cases, the an-
swer should be longer than a paragraph (Cohen & Swerdlik, 1999). The constructed-
response format is often used for items on tests like a course exam. The advantages of
using this type of format include the possibility of assessing examinees' understanding
of course content on a deeper level than the level that may be assessed by other item
formats. Disadvantages include difficulty in scoring and the length of time examinees
may take to answer short-answer and essay questions.
Think About It 6.1 What type of test item format would be the most
effective to measure your ability to understand the information in this chap-
ter. What types of item formats do you prefer? What types do you dislike?
Why?
An Example of Item Generation
When Sheperis (2001) was generating the items for the BARIS, he first had to deter-
mine how many items to devote to each of the three statuses of racial identity devel-
opment that the test was intended to measure (assimilation, self-segregation, and
universal acceptance). He chose the proportion of items that would apply to each
status. The number of items applying to each status is roughly equivalent, and any
differences in proportion are accounted for in the scoring procedures.
The next decision Sheperis (2001) had to make was which item format he
would use. Although the dichotomous format is often used in personality and atti-
tude assessments, Sheperis chose the Likert format, which gave examinees more lat-
itude to describe their beliefs than the dichotomous format would have. The di-
chotomous format would have allowed examinees only to agree or disagree.
TECHNICAL ANALYSES
Many counseling students will take a comprehensive exam prior to graduation.
Today many counseling programs use a standardized exam developed by the Center
for Credentialing and Education (CCE; www.cce-global.org), called the Counselor
Preparation Comprehension Examination (CPCE). Part of the reason for adopting a
standardized exam is the difficulty involved in developing appropriate items from se-
mester to semester. It is much easier and less expensive for university counseling pro-
gram faculty to use a published instrument than to develop a quality comprehensive
exam on their own. Developing good items for a test requires the test author to eval-
uate each item in a number of ways. This process of evaluation is typically referred
to as item analysis and involves an examination of item difficulty and item discrim-
ination. Item analysis involves a variety of statistical techniques, and the process can
be quite complex. Only a cursory overview of the process is presented here. Readers
interested in a more in-depth discussion of item analysis are referred to Anastasi &
Urbina(1997).
202 Chapter 6
Item Difficulty
When preparing for a "comprehensive exam," it is important to recognize that stu-
dents probably won't answer all of the items correctly. These types of exams are usu-
ally criterion exams and are based on an examination of minimal competency in re-
lation to a criterion rather than on competition among examinees. Some of the items
will be difficult for most examinees to answer. So why not make the questions eas-
ier? Let's assume that all examinees pass the comprehensive exam with flying colors.
This would indicate that each student has met the minimum criterion for knowl-
edge of practice in counseling. However, because the test items did not discriminate
among examinees, it would be difficult, if not impossible, to make this assertion.
Thus some students who did not possess adequate knowledge of the profession
would be granted degrees. Because a main ethical principal is to "do no harm," cre-
ating a test that everyone could pass would be highly unethical. Conversely, if one
created a comprehensive exam that no one could pass, then it would still fail to dis-
criminate among students. Professors would also have a large number of disgruntled
students to manage. Thus the task of item development is complex.
Item difficulty is a central issue in the technical analysis of a test; especially meas-
ures of achievement or ability. Item difficulty is defined in terms of the number of ex-
aminees who answer an item correctly. Thus, if 50% of the participants answer a par-
ticular item correctly, that item has an item difficulty index of 0.50. Would this be a
good item? Is it difficult enough? The essence of item difficulty analysis is to deter-
mine the degree to which an examinee could correctly answer an item by chance
alone. If the item with a 0.50 difficulty index is a true-false question, the examinee
would have a 50% likelihood of getting the right answer by chance. Although as a
student you might like these odds, the truth is that the item would not discriminate
adequately between those who truly knew the answer and those who did not.
So how does one set an appropriate discrimination index and make sure that
each item meets this index? The first step is to determine the percentage of correct
responses related to chance. To illustrate, let's continue with the true-false item and
the 50% rate due to chance. To establish the usefulness of this true-false item, we
must seek a discrimination index that is higher than 50%. Based on best practices in
the field, we usually set the difficulty level halfway between a difficulty level of 100%
(i.e., everyone getting the item right) and the rate of chance (i.e., 50%). To calculate
the optimum difficulty level for our sample item, we subtract the chance level (50%)
from the 100% success level and then divide the result by 2. The last step is to add
the result of our division to the chance rate, thus providing an optimum difficulty
level. In this case,
100 ~ - 30 = -^ = 0.25 0.25 + 0.50 = 0.75 (optimal item difficulty level)
Thus it would be expected that 75% of individuals attempting this item would
answer it correctly. Considering the purpose of comprehensive exams (i.e., minimum
competency), this might be an appropriate difficulty level. However, it is important
to vary the difficulty level of items throughout the exam. Most people have taken a
test in which the first item completely stumped them and the resulting performance
suffered whether one knew the remaining answers or not. For this reason, a good ap-
Item Discrimination
How Tests Are Constructed 203
proach to test construction is to place easier items at the beginning of a test and to
increase item difficulty as the test progresses. This allows examinees a chance to build
confidence in their performance and may reduce anxiety surrounding the test situa-
tion. In some cases, test authors may even provide items at the beginning of a test
that have a 1 .0 item difficulty index to increase the positive psychological state of ex-
aminees. However, it should be noted that items that approach 1 .0 or are typically
discarded because of their inability to discriminate among respondents. The typical
item difficulty index ranges between 0.30 and 0.70 for most tests in which responses
are marked right or wrong. However, some test authors seeking greater scrutiny of
test-taker knowledge may employ a sample of more difficult items. For example,
some states are now employing a clinical exam for licensure as a professional coun-
selor. This type of exam is usually related to practice knowledge as opposed to the
theory knowledge inherent in the "comprehensive exam" example. Because the pub-
lic welfare is at stake with regard to a licensure exam, it would make sense to have
greater scrutiny of applicants through the use of more difficult items.
In theory, the purpose of an item discrimination index is to help assess the quality of
a particular item. This task is achieved by examining the relationship between total
test performance and performance on each individual item. By determining this re-
lationship, we can decide if an item discriminates positively, discriminates negatively,
or does not discriminate at all. A positively discriminating item is one that is answered
correctly more often by those who perform well on the test. In contrast, a negatively
discriminating item is one that is answered correctly by those who perform poorly on
the test. A nondiscriminating item fails to indicate a relationship between correct re-
sponse and test performance. There are numerous statistically derived, computer-
generated, item discrimination indices, and the reader is referred to an SPSS manual
or statistics text for in-depth study.
Some professional counselors may find the discussion of psychometric evalua-
tion, such as item discrimination, tedious and may even wonder how these types of
analyses will apply to work in the field. Although few students will likely pursue a ca-
reer in test construction, it is important to be a qualified user of psychological in-
struments in order to function in future work settings. Part of being a qualified user
means understanding how to evaluate the usefulness of an instrument as well as un-
derstanding the usefulness of items within the instrument. The item discrimination
index functions as an indicator of the quality of an item. If one is attempting to in-
terpret the results of a test by comparing an individual's responses to a norm group,
the item discrimination index tells the degree of confidence one can have in making
an interpretation based on a response to a particular item.
Think About It 6.2 Consider such exams as the Scholastic Assessment Test
(SAT) and the Graduate Record Exam (GRF). Why would assessing item dif-
ficulty and item discrimination for these tests be especially important?
204 Chapter 6
Norms
In order to make individual raw scores or individual scale scores meaningful, test au-
thors often administer the instrument to a large comparison sample, or norm group.
The examinee's raw score is usually transformed to a standard score (e.g., z-score, T
score, percentile rank, deviation IQ, or stanine) and then compared to the perform-
ance of other individuals with similar characteristics (e.g., age, grade, gender, ethnic-
ity, etc.). This population of individuals is referred to as the standardization sample,
normative sample, or the norm group. The comparison scores are called derived scores
and are placed into two groups: developmental scores and scores of relative standing
(Salvia & Ysseldyke, 2004).
Many tests use a procedure called stratified sampling, which seeks to sample the
general population by replicating the percentage of participants according to demo-
graphic characteristics. Some important demographic characteristics commonly used
include sex (i.e. male, female); age (in years); grade (for achievement tests); race (e.g.
White, African American, Asian American, Hispanic American, Native American);
region of (U.S.) residence (e.g., south, west, northeast, north central); socioeconomic
level (e.g., parent educational attainment, family income, parent occupational sta-
tus); and area of residence (e.g., urban, suburban, rural). In America, the U.S.
Census is consulted, and participants are sampled according to their occurrence in
the general population (e.g., 50% male, 50% female).
An additional consideration is the number of participants to include in a norm
group. According to Salvia and Ysseldyke (2004), a general rule of thumb is 100 par-
ticipants per age category for screening tests, and 200 participants per age category
for diagnostic tests. Sampling is an absolutely critical consideration in test develop-
ment, and particular attention should be paid to multicultural and diversity consid-
erations. If a norm sample underrepresents key groups (e.g., racial, socioeconomic,
sex), it becomes difficult to support the accuracy of interpretations for those indi-
viduals examined using the test.
As an example, the BARIS was normed on a group of Black adolescent males in
the southern United States. Thus scores for an individual test taker can be compared
to average scores of other Black adolescent males in the same geographic region.
However, the development of norm-referenced scores is not as simple a task as is in-
dicated by this example. The nature of this chapter does not allow for extensive dis-
cussion of the development of norm-referenced scoring procedures. For further in-
formation on this topic, readers are referred to the Standards for Educational and
Psychological Testing (AERA, APA, & NCME, 1999). Readers should also refer to
Chapter 5 in this text for more in-depth discussion on this topic.
SUMMARY/CONCLUSION
The development of psychological tests is an intricate process that often takes several
years to complete effectively. In order to select quality tests, professional counselors
should develop a basic understanding of the test construction process. In general,
test construction occurs in distinct phases:
How Tests Are Constructed 205
1 . Needs analysis. Because the development of quality tests is such a time-consum-
ing process, test authors often establish a need for a certain test before begin-
ning the construction process. Needs analysis can be conducted through formal
surveys or through an analysis of current instruments available (Drummond,
2004).
2. Test purpose. Once a need for a test is established, it is then important to de-
velop clear, behavioral objectives for the development of the proposed instru-
ment. One of the objectives should be related to the construct or content do-
main to be measured (AERA et al., 1999). For example, the BARIS was
designed to measure the racial or ethnic identity of Black adolescent males.
3. Item format. Prior to beginning the development of specific items for an instru-
ment, it is important to determine the appropriate format for meeting the
stated test purpose. Item formats include multiple-choice, forced-choice, open-
response, true- false, essay, or Likert scale (Janda, 1998). In order to provide re-
spondents with a limited range of choices, a forced-choice response format was
employed for the BARIS, with the choices being (a) Strongly Agree, (b) Agree,
(c) Disagree, and (d) Strongly Disagree.
4. Choosing an approach to test construction. Several approaches to test construction
are available (e.g., rational approach, empirical approach, and bootstrap ap-
proach). The bootstrap approach was used to develop the BARIS. The bootstrap
approach is a combination of the rational approach and the empirical approach.
The item pool for the BARIS was derived from racial identity development the-
ories. Empirical methods of analyses were used to maintain or discard items
from the initial pool.
5. Item development. Writing effective test items is a difficult process. Test items
should be reviewed by a panel of experts to ensure that the items cover the do-
main being measured and to determine the degree to which the items match
the purpose of the test. Previously exiting theories and item pools should also
be explored to ensure the items included on a test represent the domain of con-
tent being assessed. Items in the BARIS were reviewed by experts in the field of
multicultural counseling.
6. Pilot test. Prior to administering the instrument to a large sample, a pilot test
should be conducted to determine item difficulty, discrimination, and compre-
hension. Test authors often ask pilot test participants to complete feedback
sheets that ask about the participants' (a) perception of the test, (b) particularly
easy or difficult items, (c) confusing terms, (d) clarity of directions, and (e) gen-
eral concerns. Test authors conduct in-depth item analysis studies to be sure
items "behave" as expected. This process was completed in the initial pilot test
for the BARIS.
7. Item review. After the initial pilot test, it is important to review findings about
item difficulty and discrimination in order to determine items that should be
removed from the item pool. Test authors should also examine items for bias
(i.e., cultural, gender, socio-economic, ability, and sexuality). According to the
Standards for Educational and Psychological Testing (AERA et al., 1999, p. 82),
"Test developers should strive to identify and eliminate language, symbols,
206 Chapter 6
words, phrases, and content that are generally regarded as offensive by mem-
bers of racial, ethnic, gender, or other groups, except when judged to be neces-
sary for adequate representation of the domain."
8. Preparing the test for operational use. Once the pilot test has been completed and
the remaining items are reviewed for bias, it is important to prepare the test for
operational use. This means that the author should review the objectives and
purpose of the test to ensure that the resulting instrument still meets the origi-
nal intent of the author; scoring procedures should be independently verified;
and the instrument should be reviewed by various committees.
9. Establishing the psychometric properties of the test. One of the last steps in test de-
velopment is to establish the technical properties. The test author must deter-
mine an appropriate sample size for the statistical analyses to be performed on
the instrument. Sample size can vary greatly depending on the analyses em-
ployed. Once sample size is determined, the test author administers the instru-
ment, scores it, and computes reliability and validity coefficients (Drummond,
2004). This process can occur in several phases and several individual research
endeavors. Finally, the author develops norms for the test.
1 0. Ensuring the appropriateness of the norm or criterion group. Test authors provide
norms derived from appropriate sampling procedures (e.g., stratified or selective
samples) that account for multicultural and diversity considerations. Test users
must ensure that the test is used to make decisions only about clients for whom
the test was designed and validated for use.
Think About It 6.3 Why would it be important to carry out all of these
steps when developing a test? What would happen if a step were skipped?
Would the test still be an effective measure of the desired construct? Explain.
KEY TERMS
age range
bootstrap approach
content
criterion referenced
dichotomous format
empirical approach
items
item analysis
item difficulty
item format
norm group
norm referenced
objectives
observables
polytomous format
population
purpose
rational approach
reading ability
scaling
stratified sampling
summative scale
theory
CHAPTER
7
Clinical Assessment
by Bradley T. Erford, Carol Salisbury, Kathleen McNinch,
Carl Sheperis, R. Anthony Doggett, and Ota Masanori
Overall, professional counselors in clinical practice engage in clinical and per-
sonality assessment more frequently than any other type of assessment.
Knowing the characteristics and conditions of clients is important regard-
less of counseling specialty. Clinical and personality assessment is defined and ex-
plored in detail in this chapter, and numerous inventories commonly used by pro-
fessional counselors are presented and reviewed. In addition, the basic process of
clinical interviewing is introduced, both for general and for more specific purposes,
such as when conducting a mental status exam. Personality assessment is viewed
from both the psychoanalytic and the "big-five model" perspectives, thus allowing a
basic introduction to projective and objective personality assessment.
WHAT IS CLINICAL ASSESSMENT?
To some, clinical assessment and personality assessment are one and the same. They
are ways of understanding the dispositions, characteristics, strengths, and limita-
tions of the internal world of a client and how that client interacts and functions
within the client's external world. Some even view personality as a global, holistic,
all-encompassing construct that subsumes all the other facets of life and especially
the facets of assessment covered in this book. In other words, in the broadest sense
of the word, intelligence, aptitude, achievement, career, normal and abnormal be-
havior and emotions, personal adjustment, family, and everything else are sub-
sumed under the category of personality. Unfortunately, while well intentioned,
207
208 Chapter 7
such a perspective or approach broadens the study of personality far beyond a man-
ageable degree. The perspective taken throughout this chapter is far more pro-
scribed. Here, clinical assessment is defined as the measurement of clinical symp-
toms and pathology in the human condition — in other words, assessment for the
purpose of clinical diagnosis. Personality assessment, on the other hand, is the
measurement of client traits, needs, motivations, attitudes, or other facets that de-
scribe how the client interacts with the external environment, others within that
environment, and within the client's internal world. While some may view some
of these intrapersonal or interpersonal interactions to be normal or abnormal, the
purpose of personality assessment is more appropriately conceived as describing
the personal functioning of an individual globally or within some context.
While some may view this distinction as artificial, the implications are not.
Professional counselors are often required to diagnose and treat clients with mental
and emotional disorders. A client may present with symptoms of depression, anxi-
ety, disruptive behavior, substance use, and so forth. To diagnose and treat the client
in an ethical and professional manner, professional counselors will rely on tests and
techniques that facilitate the diagnostic and treatment process, and determine the
outcomes of treatment — three primary purposes of clinical assessment. While it may
be helpful to understand the personality characteristics of a client, it is not always es-
sential for effective treatment, particularly when using brief treatment approaches.
When diagnosing and treating clients, professional counselors often use assessment
procedures such as clinical interviewing, structured clinical tests — e.g., the Minnesota
Multiphasic Personality Inventory — Second Edition (MMPI-2), the Millon Clinical
Multiaxial Inventory — III (MCMI-III) — and a mental status exam to facilitate effi-
cient and accurate diagnosis and treatment.
When a client seeks counseling for self-growth or a personal or interpersonal
problem not amenable to clinical diagnosis, clinical assessment is probably not war-
ranted. However, personality tests can be helpful in deepening both the professional
counselor's and the client's understanding of the client's personality and coping
mechanisms when under normal and stressful circumstances. Developing such an
understanding of thoughts, feelings, and behaviors provides a basis for clients to un-
derstand why they think, feel, and behave the way that they do. To facilitate this un-
derstanding, professional counselors often use assessment procedures such as devel-
opmental interviewing, structured personality tests — e.g., the Myers-Briggs Type
Indicator (A/577), the {I6PF), the {NEO-PI-R) — or unstructured, projective tests
and techniques — e.g., House-Tree-Person, Incomplete Sentences, Thematic Apperception
Test. These instruments and the general topic of personality assessment are addressed
in more detail in Chapter 8.
Importantly, many psychological instruments can provide helpful information
to understand a clients clinical issues and personality functioning. So, while these
categories may seem mutually exclusive, tests and test items can be designed to pro-
vide information about both. For simplicity's sake, the authors of this chapter have
chosen to present these tests in the domain in which they are most commonly used
in clinical practice.
Clinical Assessment 209
CAUTIONS WITHIN CLINICAL ASSESSMENT
In Chapter 1, the general purposes of assessment were outlined. The three purposes
most relevant to clinical assessment are diagnosis, treatment planning, and outcomes
assessment. Cohen and Swerdlik (1999, p. 482) indicated three primary questions
addressed by clinical assessment: (1) Does this person have a mental disorder, and if
so, what is the diagnosis? (2) What is the person's current level of functioning? (3)
What type of treatment shall this patient be offered? Erford (2006, p. 9) added an-
other: How effective were the implemented interventions?
Many professional counselors find it most efficient to use a combination of in-
terviewing and structured test administration to quickly and accurately diagnose
client concerns. That said, if all clients were totally self- aware, open, and forthright
in their responses, clinical assessment would be simple, and the text of this chapter
could move immediately to the sections on interviewing and structured inventories.
Unfortunately, clients present with varying levels of self-awareness, openness, and
forthrightness, and professional counselors must take great care to ensure that the
diagnostic and treatment decisions made about a client are based on accurate infor-
mation. Thus, professional counselors must be well aware of important bias issues
in both assessment and decision making (i.e., judgment).
Bias in clinical interviewing has been studied for years. Darley and Fazio (1980)
coined the term hypothesis confirmation bias to explain the observed phenomenon
in which interviewers develop hypotheses to explain the concerns being presented
by a client and then proceed to ask questions and elicit responses that confirm those
hypotheses. While on the surface this may sound like good, sound practice, Darley
and Fazio found that clinicians frequently confirmed incorrect hypotheses by inter-
preting ambiguous information as supportive of the hypothesis and discounting ev-
idence that did not support the hypothesis. Likewise, the term self-fulfilling
prophecy (Dipboye, 1982) has been used to describe the client's propensity to
change responses and behavior to conform to the expectations of the examiner.
Often, the client will actually change thoughts, feelings, or actions to align with the
perceived expectations of the interviewer. For example, assume a client with low anx-
iety responds that he or she feels anxious from time to time to a mild degree. If the
professional counselor pursues this issue with a line of questioning aimed at under-
standing the degree of anxiety involved, especially in the context of situations the
client may find otherwise troublesome (e.g., interpersonal or workplace relation-
ships), then the client may perceive and "admit" the anxiety to be more problematic
than first suspected. Thus, the client fulfills the perceived prophecy, even though it
may not be true. With these possible threats to the validity of interview results, pro-
fessional counselors in training may wonder why interviewing is so popular among
clinicians. Again, bias provides the answers. Arvy and Campion (1982) suggested
three reasons: (1) Interviews provide a depth of information and perspective that is
difficult to obtain using tests alone, (2) clinicians believe themselves to be unbiased,
ostensibly because they are good, helpful people, and (3) clinicians believe they are
objective and unbiased because they are highly trained and skilled. Note that the
final two reasons involve beliefs on the part of the clinician. No matter how well
210 Chapter 7
intentioned, any belief can be biased. After all, that is why it is called a belief and
not a truism, fact, or law. Every professional counselor must guard against interview
response bias. No one is immune.
Equally important, test results can also be biased and inaccurate. It is not hard
to understand that results will be inaccurate if someone responds dishonestly to
questions. But in actuality, many factors influence student and client responses to
items or questions and their subsequent scores on tests. Sometimes these factors may
be related to the test itself, while at other times to examiner or examinee variables.
Some clients or students may present themselves dishonestly, or lack self-awareness
to respond appropriately. Others may not trust the professional counselor for a vari-
ety of reasons, some of which have more to do with the client than the counselor.
Still others may respond inaccurately because of the way a question is phrased, or
the type of response choices required. Regardless of the cause, the result is problem-
atic. Inaccurate client responses lead to inaccurate scores, inferences, and interpreta-
tions (i.e., errors). Table 7.1 provides brief descriptions of a number of factors influ-
encing client responses and performances commonly encountered by professional
counselors in clinical practice. A more in-depth discussion of these issues can be
found in Erford (2006).
In the context of this discussion of clinical response accuracy, further expansion
of this list becomes necessary. In the early years of psychological testing (i.e.,
1920s- 1930s), little concern was given to the accuracy of client responses to person-
ality or clinical questions. Many assumed that clients would respond honestly, and
while many clients did respond honestly, examiners quickly learned that not every-
one did. While honesty is a good thing, the present-day field of assessment has
evolved in such a way that many clients seek the services of professional counselors
for help with issues of great importance: child custody, criminal actions, disability
documentation, infidelity, and divorce, to name but a few. Likewise, client self-
awareness and the relationship between client and counselor can significantly influ-
ence the accuracy of client responses. Thus, to assume that all clients always respond
accurately is naive and dangerous. A professional counselor's judgment frequently
has personal, financial, and legal implications in such high-stakes decisions.
During the 1940s and through present day, developers of clinical, personality,
and behavioral inventories have expended a great deal of effort to construct validity
scales that can help identify client response styles. Identification of these response
modes can help professional counselors identify clients whose test protocols may be
invalid or should be interpreted with caution. Many tests provide validity scales, and
the names and functions of these scales vary widely. A good example of a present-
day clinical instrument with helpful validity scales is the 567-item, true-false
Minnesota Mtdtiphasic Personality Inventory — Second Edition {MM PI-2). The MMPI-
2 offers a number of helpful scales, including Cannot Say (?), VRIN, TRIN, F, L, K,
and S (Butcher et aL, 2001).
While clients are encouraged to answer every one of the MMPl-2% 567 ques-
tions, many do not. Because raw scores are summed and used to determine a client's
norm-referenced score, a client who does not complete a significant number of ques-
tions may have deflated scores. This is because failing to answer a question is scored
in the nonkeycd (i.e., not clinically relevant) direction, as if to indicate that the client
Clinical Assessment
211
Table 7.1 Factors that influence student and client test performance and item responses
Factor
Description
Motivation
Anxiety
Coaching
Test Sophistication
Acquiescence
Response format
Reactive effects
Response bias
Physical or psychological
condition
Social desirability
Environmental variables
Cultural bias
Examiner-Examinee variables
Previous testing experiences
Motivated clients provide accurate responses; unmotivated clients provide subpar performance,
inaccurate, and/or dishonest responses. Client motivation is the most important performance
factor.
High and low levels of anxiety lead to low levels of performance. Moderate levels of anxiety
maximize performance. This is referred to as the Yerkes-Dodson law.
Coaching is any procedure that gives a respondent an advantage. Coaching can involve anything
from a simple review of the domain of information being assessed to instructions on giving
specific responses to specific questions that will appear on a test. Suspicions of coaching should
be followed up on by the examiner.
Test sophistication refers to procedural advantages enjoyed by some test takers, but not others
(e.g., experience filling in bubble response forms).
The tendency to answer yes to yes/no questions and true to true/false questions when an
examinee is unsure of the correct answer.
Clients with reading problems, writing problems, poor vision, or disabilities that make sitting
difficult may become frustrated with a test requiring reading or constructed written responses.
Allowances should be made for audio-taped administration and oral response procedures when
possible.
Clients may alter response styles and patterns in response to the interview or evaluation process
(i.e., a series of questions about depressive symptoms could lead clients to perceive in themselves
a greater degree of depression than previously considered).
A client's response to a question influences responses to future questions (i.e., students who
select "False" three times in a row may be more likely to select "False" on the next item, even
though they would have otherwise selected "True").
Clients sometimes present with visual or auditory acuity problems or psychological processing
deficiencies (e.g., central auditory processing disorder). In addition, mental disorders can cause
psychological conditions that detrimentally affect test performance, such as moderate to severe
depression or anxiety, or other disorders that exacerbate mood or distractibility.
Some clients, consciously or unconsciously, may respond in a way that portrays themselves in a
more favorable manner (i.e., faking good) and appear less significantly impaired than they really
are. Others may portray themselves in a less favorable light (i.e., faking bad) and appear more
severely impaired than they really are.
Some common individual-specific environmental effects include time of day, testing room,
lighting, seating arrangements/comfort, noise, and interruptions. Each could affect a client's or
student's motivation and performance, but the effects are so individualized that scientific
generalizations are normally lacking. Following standardized procedures and minimizing
environmental influences are primarily examiner responsibilities.
Impressions, interpretations, and diagnoses can be influenced by the culture of the examinee and
examiner. The professional counselor strives for multicultural competence to minimize biased
conclusions.
Some clients and professional counselors just seem to hit it off; others don't. Race, sex, culture,
attractiveness, personality, and other variables may influence a client's performance, but scientific
study indicates they seldom do.
Positive or negative previous assessment experiences may lead to higher experiences or lower self-
confidence, thus influencing motivation and performance. Also, some clients may remember
content from a previous administration of a test and may have a "memory" advantage on
intelligence and achievement tests.
Source: The Counselor's Guide to Clinical, Personality, and Behavioral Assessment by B. T.Erford, (2006), (ed.). Boston: Lahaska Press/Houghton
Mifflin.
212 Chapter 7
does not have a problem. The Cannot Say (?) scale is simply a count of the items to
which no response was made. Generally if clients fail to respond to 30 or more items
(about 5%), the protocol may be judged invalid; if 1 1-29 questions are not an-
swered, caution is warranted because some subscales may be invalid. Several helpful
scales are termed "content-free," because the content of the scale is not important in
determining score validity. VRIN is the acronym for the Variable Response
Inconsistency scale, which measures a client's pattern of inconsistent responding to
pairs of items nearly identical in content. The VRIN raw score indicates the num-
ber of inconsistent client responses. Inconsistent responding may mean the client is
not paying attention, not taking the task seriously, or doesn't comprehend the item
meanings. TRIN is the acronym for the True Response Inconsistency scale, which
measures a client's pattern of inconsistent responding to pairs of items of opposite
content. The TRIN raw score indicates the degree of client response inconsistency
due to "yea-saying" (acquiescence) or "nay-saying" (nonacquiescence). T scores of
80+ on the VRIN or TRIN scales indicate the protocol is invalid.
Other validity scales are content-specific. The Infrequency scale (F) is a measure
alerting clinicians to unusual patterns of answers. These 60 items were selected be-
cause they were infrequently endorsed by members of the original MMPI norm sam-
ple. Clients with a high score on the F scale (T = 100+) generally are random respon-
ded (i.e., paying no attention to items and just coloring in bubbles) or fixed
responders (i.e., mostly all true or mostly all false), or are "faking bad" by deliberately
trying to portray themselves in a negative light. Of course, the professional coun-
selor must rule out whether the client may also be accurately portraying severe
pathology. The MMPI-2 also has F B (Back F) and Fp (Infrequency-Psychopathology)
scales. The F B scale is an infrequency-of-response scale for the latter part of the test
and, when compared to the total F score, helps determine whether clients changed
their response approach during the administration (e.g., the client got bored and
began to respond randomly, or to overreport symptoms). The Fp scale is interpreted
in conjunction with VRIN and TRIN scales to determine whether a client may be
responding randomly, "faking bad," or exaggerating pathological symptoms.
The L scale was originally developed to assess the existence of a defensive mind-
set by allowing clients to deny the existence of minor faults and flaws that most oth-
ers readily admitted. While it may indicate deceit in test taking, the L scale is fre-
quently used in conjunction with TRIN to determine the presence of "faking good"
and nonacquiescence (i.e., nea-saying) when responding. The K scale was originally
developed to measure client response defensiveness so as to correct for this response
style on the clinical scales. It was believed that if a clinician knew that a client was
responding in such a way that would invalidate the protocol, corrections could be
made to the clinical scales so as to still derive meaningful results from them.
Interestingly, some researchers (McCrae & Costa, 1989) have shown that uncor-
rected scale scores have higher validity than K-corrected scores. On other tests, re-
searchers have also demonstrated this to be the case (Hsu, 1986; Kozma & Stones,
l')87; McCrae & Costa, 1983). The S scale (Superlative Self-Presentation) was em-
pirically derived by Butcher & I Ian ( 1 l ) c )S) by identifying items that were helpful in
discriminating between defensive and normal job applicants and norm sample par-
Clinical Assessment 213
ticipants. Similar to the K scale, the S scale may also be helpful in determining
whether clients are presenting themselves in a socially desirable or nonacquiescent
manner.
Other clinical and personality tests have various validity scales under different
names designed to assess client response patterns for varying purposes. And the use
of validity scales is on the rise, no doubt due to clinician desire to more accurately
identify invalid response protocols and better technology for developing such scales.
Professional counselors are advised to seek specialized training and to read the man-
uals of instruments using these scales in order to fully understand how this technol-
ogy can be harnessed to enhance scale interpretation, and to understand the impli-
cations of elevated scores.
CLINICAL JUDGMENT VERSUS STATISTICAL MODELS
Professional counselors must be wary of bias not only from clients and other informa-
tion sources, but also from within themselves. Most professional counselors have great
faith in their own clinical judgment; after all, professional counselors spend years in
education and clinical preparation to practice their craft. They have successes and set-
backs but are constantly improving and honing their skills to the point of competent
practice. It is easy to assume that such a rigorous program of study and practice under
supervision will remove bias and sharpen the professional counselor's clinical objectiv-
ity. Unfortunately, such is not always the case. Regardless of how well educated, well
trained, and well practiced one becomes, a professional counselor is only as perfect as
the information obtained and interpreted and the decision-making model employed.
Hopefully, the information presented in the preceding chapters has given readers an
appreciation for the imperfection of the information they will encounter, the interpre-
tive strategies they will employ, and the accuracy rates of various decision-making
models. Errors will always be with us. However, there are ways that a professional
counselor can increase the likelihood of more accurate decision making.
Much has been written about the efficacy of decision-making models employing
clinical judgment versus statistical models — and the evidence is that the statistical
models are at least as accurate as, and usually superior to, clinical judgment (Dawes,
1971; Dawes & Corrigan, 1974; Goldberg, 1970; Meehl, 1954, 1957, 1965).
Intuitively, this makes sense, because statistical models are based on probabilities that
can be empirically replicated, studied, and often improved on. Clinical judgment is
individual-specific, so what makes sense to one professional counselor, may not only
not make sense to another professional counselor, but also may not be easily replicated
by another professional counselor. As with all things related to measurement, reliabil-
ity (i.e., replicability) sets the upward boundary for validity. So if clinicians cannot
replicate a decision model efficiently, the validity of results will be lowered. Such is the
advantage of a statistical decision-making model; it is easily understood and replica-
ble, therefore may produce more accurate decisions, although it may not always be
presumed to do so. Betting against a statistical model is similar to betting against the
house in a game of chance. Sometimes you will win, but the odds are always against
you; skill and knowledge helps sometimes, but not most of the time.
214 Chapter 7
Statistical models in clinical decision making often rely on the use of cutoff
scores that are empirically validated. Professional counselors are wise to consider the
implications of "betting against" the statistical model. Experienced clinicians know
the value of multiple sources of information from multiple respondents. When clin-
ical judgment disagrees with the statistical model, the experienced clinician usually
realizes that it is best to collect more information to arrive at a more reasoned deci-
sion that one can endorse with greater confidence.
Think About It 7.1 In your practice as a professional counselor, you will
encounter situations in which a decision using your "statistical model" does
not agree with your "clinical judgment." How will you reconcile this conflict
to arrive at the best decision for your client?
CLINICAL INTERVIEWING
There are several essential components to an effective interview. First, establishing
rapport is crucial. A professional counselor must relate a sense of mutual understand-
ing, confidence, respect, and acceptance in order to facilitate effective rapport
(Sattler, 2002). Establishing rapport is especially important in the initial interview to
help clients feel comfortable enough to openly discuss their reasons for coming to
counseling. Second, an interviewer needs to have effective facilitative skills. Effective
interviews (a) identify client problems clearly; (b) obtain necessary information re-
lated to the problems (e.g., antecedent, consequence); (c) assess client functioning,
intellectual level, and psychosocial development; and (d) examine the effects of an
intervention during and after the intervention. As Kratochwill, Sheridan, Carlson,
and Lasecki (1999) posited, eliciting useful information largely depends on the in-
terviewer's ability to strategically use questions and statements.
Three Types of Interviews:
Unstructured, Semi- Structured, and Structured
Depending on the purpose of the interview, a professional counselor should choose
an appropriate interview level from among the following: (1) structured, (2) semi-
structured, and (3) unstructured. The structured interview has established question
formats and is often used to assess or diagnose disorders. Generally, structured inter-
views are shown to yield more reliable results because they are able to be more accu-
rately replicated by others and are less subject to a clinician's biases. It is unclear
whether structured interviews yield more valid results (McReynolds, 1989). In a
structured interview, every professional counselor asks the same set of questions in
the same order, regardless of the examinee. Some structured interview formats are
purposely broad in scope and function, others narrow. Erford provided an example
of a structured clinical interview of a narrow focus (see Erford, 2006, Appendix C:
Attention-Deficit! Hyperactivity Disorder [AD/HD] Brief Clinical Parent Interview
Clinical Assessment
215
Table 7.2 Published structured interviews
CIDI-Core Composite International Diagnostic Interview: Authorized Core Version 1.0
{World Health Organization, 1993)
DIS Diagnostic Interview Schedule {National Institute of Mental Health, 1990)
DICA-R Diagnostic Interview for Children and Adolescents 8.0 (Reich, 1996)
DISC-IV Diagnostic Interview Schedule for Children (Shaffer, 1 996)
CAPA Child and Adolescent Psychiatric Assessment Version 4.2 — Child Version (Angold,
Cox, Pendergast, Rutter, & Simonoff, 1 996)
CAS Child Adolescent Schedule (Hodges, 1997)
K-SADS-IVR Schedule for Affective Disorders & Schizophrenia for School-Age Children
(Ambrosini & Dixon, 1996)
K-SADS-PL Revised Schedule for Affective Disorders & Schizophrenia for School-Age Children:
Present and Lifetime Version (Kaufman, Birmaher, Brent, Rao, & Ryan, 1996)
K-SADS-E5 Schedule for Affective Disorders & Schizophrenia for School-Age Children,
Epidemiological Version 5 (Orvaschel, 1995)
[ABCPI]). Examples of published broad-spectrum structured interviews are provided
in Table 7.2.
The semi-structured interview may also have a specific question format that is
used to assess specific mental health issues or psychological disorders. However, in
contrast to the structured interview, a professional counselor can modify questions
or change the order in which the questions are asked depending on a client's level of
functioning (e.g., verbal or intellectual level) or other situational requirements
(Sattler, 2002). Erford provided a good example of a semi-structured interview (see
Erford, 2006; Appendix D: Semi-Structured Mental Status Examination Interview
Protocol). Some other published semi-structured interviews are provided in Table 7.3.
Finally, the unstructured interview has no standardized question format. An in-
terviewer chooses questions depending on the client and situation. In order to con-
duct an effective unstructured interview for clinical diagnostic purposes, a profes-
sional counselor should have advanced assessment training and be able to elicit the
clients concerns through appropriate questions. The skilled professional counselor
can use the unstructured interview as an effective tool to establish rapport and to
elicit concerns freely during an intake interview. Regardless of the type of interview
Table 7.3 Published semi-structured interviews
SCID-CV Structured Clinical Interview for Axis I DSM-PV Disorders (First, Spitzer, et al.,
1997).
SCID-II Structured Clinical Interview for Axis II DSM-PV Disorders (First, Gibbon, et al.,
1997).
PRISM Psychiatric Research Interview for Substance and Mental Disorders (Hassin et al.,
1996).
SCICA Structured Clinical Interview for Children and Adolescents (McConaughy &
Achenbach, 1994).
216 Chapter 7
The Intake Interview
employed, it is crucial to establish rapport and to elicit necessary information
through effective verbal communication. Like any other facet of effective counsel-
ing, facilitative skills are an essential component.
The purpose of the intake interview is to collect relevant information about a clients
history and background in order to quickly ascertain the effects past events may have
on the client's current situation. Previous history often helps professional counselors
to provide a context for current struggles, determine the longevity of symptoms, and
tailor treatment interventions to the client's specific context. For example, a client
presenting with a five-year history of substantial symptoms of anxiety likely will re-
quire a different diagnostic and treatment approach than someone who has devel-
oped substantial symptoms only during the past month.
The major advantage of a structured intake interview is that it can be completed
by a client prior to the first session. Then the professional counselor can peruse the
client's responses and follow up with any details or questions concerning original
client responses. This saves a great deal of time. Of course, the professional coun-
selor should verify client responses and expand on them as necessary, because clients
sometimes misunderstand the intent of a given question, or are hesitant to provide
full disclosure; that is, some clients understandably reveal more in a person-to-per-
son interview than on a piece of paper. Erford (2006) developed a comprehensive
eight-page structured Client History and Background intake form that professional
counselors will find useful. Erford (p. 8) also specified the eight key areas that make
up a comprehensive intake interview:
1 . Demographic information: name, age, sex, marital status, race or ethnicity, reli-
gion, socioeconomic status, occupation, and languages spoken.
2. Referral reasons: symptoms or complaints, including whether the complaint is
likely to end up as a legal issue.
3. Current situation: severity of the referral complaints' resiliency factors, such as
client strengths and important support figures. This area also includes changes in
functioning as a result of the referral concern.
4. Previous assessments and counseling experiences: what led to initiation of previous
services, what interventions were attempted, and any outcomes or such interven-
tions. It is also important to determine previously offered diagnoses and medica-
tions taken to address mental and emotional issues.
5. Birth and developmental history: circumstances of birth and delivery, timing of
early developmental milestones, or difficulties encountered during development.
6. Family history: composition of family of origin and current family; any educa-
tional, medical, or psychological difficulties family members may display or have
displayed in the past.
7. Medical history: major injuries, surgeries, conditions or illnesses, and medica-
tions currently taken. This area also includes the client's current medical
status.
Mental Status Exam
Clinical Assessment 2 1 7
Educational and work background: highest education completed, learning diffi-
culties encountered, special services received, work history, and current work set-
ting and satisfaction.
Think About It 7.2 Describe the importance of a thorough intake inter-
view. How could your ability to establish rapport and use facilitative skills in-
fluence the intake interview, the initial session, and future counseling sessions?
A special application of clinical interviewing that professional counselors should be-
come proficient in is called the mental status exam (MSE). The MSE is to mental
health practitioners what the general physical examination is to medical practition-
ers. The MSE is a quick screening of a client's intellectual, emotional, and neurolog-
ical functioning. In general, the MSE is a brief summary narrative of client general
mental function and is usually conducted during the first interview. MSEs are fre-
quently required by third-party payers (i.e., insurance companies), and the level of
detail required varies substantially. Erford (2006), in a detailed discussion that in-
cluded a sample Semi-Structured MSE Interview Protocol, reported that a comprehen-
sive MSE should assess the following six areas:
1. Appearance, attitude, and behavior: manner of dress, cleanliness, appearance,
demographic information, occupation, physical characteristics, health, size, hear-
ing, vision, eye contact, attitude toward examiner, attitude toward interview,
motor functioning, behavior exhibited.
2. Cognitive capabilities: knowledge of name, location, time, day, date; long- and
short-term memory; serial 7s; spelling a word backwards; math problem solving;
digit span; sentence memory; level of consciousness; concentration; capacity for
abstract reasoning; demonstration of reading, math and writing tasks; cognitive
functioning.
3. Speech and language: description of speech capability; description of language ca-
pability; repetition of phrases; read a short passage; write a short passage.
4. Thought content and process: description of thought processes; description of
thought content; fears or phobias.
5. Emotional status: presenting mood, intensity, duration, fluctuations; description
of affect, intensity, range, variability; modulation and appropriateness of affect;
personality characteristics; emotional, physical, or behavioral problems.
6. Insight and judgment: description of insight and judgment; responses to judg-
ment questions; decision making regarding presenting problem, past and future
events; defense mechanisms.
Erford (2006, pp. 172-173) provided an example mental status exam:
Matthew was appropriately dressed in jeans and a T-shirt. He appeared clean,
well-groomed, and relaxed. He is a 15-year-old, English-speaking, White,
218 Chapter 7
9th-grade male with normal physical features and no sign of handicaps, scars,
or other signs of self-mutilation. He is approximately 5' 8", 150 pounds, and
his hearing and vision are normal. Matthew maintained appropriate eye con-
tact and was cooperative and open throughout the evaluation. His motor func-
tioning was basically normal, although he did frequently "bounce his knee"
and adjust his posture indicating signs of overactivity. He demonstrated poor
fine-motor coordination during writing tasks and finger-touching activities.
He did not display aggressive, irritable, anxious, or otherwise abnormal behav-
ior throughout the evaluation.
Cognitively, Matthew was oriented x 5 and was able to answer basic infor-
mation questions, including the current and former president, capital of
Maryland, serial 7s, and simple math problems. His short-term memory and de-
layed recall was appropriate for three objects, as was his dichotic and verbal re-
tention. His consciousness was normal. Dysgraphia was evident and should be
ruled out through diagnostic evaluation. He was somewhat distractible in the
one-to-one situation, but his cognitive functioning was otherwise normal.
Matthew's speech and language capabilities were normal in all regards. His
thought processes were clear, appropriate, and logical, and his thought content
was normal- — devoid of phobic, obsessive, or psychotic process. Matthew's
mood was observed to be friendly, pleasant, and calm, with normal intensity and
little fluctuation. His affect was appropriate as he was able to modulate an ap-
propriate affective range and intensity, even when discussing emotional content.
He admitted being oppositional and appeared ambiverted. Matthew did not re-
port significant emotional, physical, or behavioral problems.
Finally, Matthew's insight and judgment appeared normal, appropriate, and
realistic. He was able to clearly describe his decision-making processes and an-
swer questions requiring judgment. Matthew acknowledged the problems re-
ported by parents and teachers, willingly consented to this evaluation, and was
willing to "do whatever it takes" to address the issues.
The mental status exam can be administered either through an unstructured,
semi-structured, or structured interview format and, of course, relies heavily on ob-
servation of attitudes, behaviors, and appearance. Use of an unstructured format re-
quires a great deal of experience with the content and format of the mental status
exam and basically involves asking pertinent questions from the categories specified
above. As with any unstructured interview, the questions will vary from client to
client and occur in no particular order, maximizing the clinician's flexibility and
adaptability to the conditions and client responses.
An example of a comprehensive semi-structured presentation of a mental status
examination has been mentioned earlier and can be found in Erford (2006). An ex-
ample of a quicker, far less comprehensive mental status exam in popular use is the
Mini-Mental State Examination (MMSE). The MMSE is a brief, structured inter-
view used to assess only the cognitive mental state (Folstein, Folstein, McHugh, &
Fanjiang, 2001). The MMSE\r<\s 1 1 categories and takes 5 to 10 minutes to admin-
ister. An examiner asks questions or gives instructions, and an examinee responds
one by one. For example, an examinee needs to (a) answer questions regarding time
Clinical Assessment 219
and place; (b) repeat, memorize, or recall some words; (c) briefly calculate simple
math problems; (d) manipulate a piece of paper according to directions; and (e) copy
a design. Summing each score (0 or 1) yields a total score, whose maximum is 30.
Though the authors of the MMSE recommend using a total score of 26 as a cutoff
score, a frequently used cutoff score is 23. A total score of 23 or below indicates the
likelihood of cognitive impairment and the necessity of further evaluation (Folstein
et al., 2001). The MMSE has been shown to produce reliable and valid scores when
screening for cognitive impairment. An example of a structured mental status exam
in common use is the Standardized Mini-Mental States Exam (SMMSE) ( Molloy,
Alemayehu, & Roberts, 1991) (see Figure 7.1). Essentially, Malloy et al. took the
MMSE and structured its administration to increase the administrative efficiency
and enhance the interrater and internal consistency reliability of scores.
Strengths and Limitations of Interviewing
A clinical or behavioral interview allows the professional counselor great latitude in
how to collect important information from clients and other stakeholders (e.g., par-
ents, teachers, spouses). A lot of important information can be collected quickly and
efficiently. However, it is good practice to validate this information and client per-
ceptions against other information sources. Aside from the important demographic
and historical information derived from an interview, the important point of con-
ducting the interview is to generate and validate hypotheses, arrive at an understand-
ing or diagnosis of the clients presenting concerns, and develop a plan of treatment
or intervention to help ameliorate the client's concerns. The interview allows for in-
depth analysis of issues, flexibility in how the information is garnered, and instanta-
neous clarification of ambiguous information. The interview also provides the pro-
fessional counselor with valuable insight into what has been tried previously to
ameliorate the client's condition, how motivated the client is to enact proposed treat-
ment strategies, and resources that the client can draw upon to effect necessary
changes (Erford, 2006).
But interviewing is not without limitations. Interview responses frequently pos-
sess lower levels of reliability and validity than more standardized inventories, al-
though structured interviews frequently rival their counterpart inventories.
Unstructured interviews are particularly problematic in this regard because of very
low interrater reliability. Professional counselors using unstructured clinical inter-
views frequently derive very different information from the interview and arrive at
very different conclusions. More specifically, clinician bias often determines which
questions are asked, what client responses are clarified and explored in depth, and
what diagnosis or conclusion is arrived at.
The clinical or behavioral interview can be an important aspect of assessing
client problems and needs. Professional counselors must use caution when interpret-
ing interview data, just as when interpreting the results of objective tests or projec-
tive measures. The key to competent assessment and diagnosis is using multiple
measures from multiple respondents, resulting in convergence of information. When
unsure, it is always advisable to collect more information. A client deserves no less.
220 Chapter 7
Figure 7.1 Standardized Mini-Mental State Examination (SMMSE)
I am going to ask you some questions and give you some problems to solve. Please try to answer as best as you can.
Max Score
1. (Allow 10 seconds for each reply)
a) What year is this? (accept exact answer only) 1
b) What season is this? (during last week of the old season or first week of a new season, accept 1
either season)
c) What month of the year is this? (on the first day of new month, or last day of the previous month, 1
accept either)
d) What is today's date? (accept previous or next date, e.g., on the 7th accept the 6th or 8th) 1
e) What day of the week is this? (accept exact answer only) 1
2. (Allow 10 seconds for each reply)
a) What country are we in? (accept exact answer only) 1
b) What province/state/county are we in? (accept exact answer only) 1
c) What city/ town are we in? (accept exact answer only) 1
d) (In clinic) What is the name of this hospital/building? (accept exact name of hospital or institution only) 1
(In home) What is the street address of this house? (accept street name and house number or
equivalent in rural areas)
e) (In clinic) What floor of the building are we on? (accept exact answer only) 1
(In home) What room are we in?
3. I am going to name 3 objects. After I have said all three objects, I want you to repeat them. 3
Remember what they are because I am going to ask you to name them again in a few minutes.
(say them slowly at approximately 1 second intervals)
Ball Car Man
For repeated use:
Bell Jar Fan
Bill Tar Can
Bull War Pan
Please repeat the 3 items for me. (score 1 point for each correct reply on the first attempt) Allow 20 seconds
for reply; if subject did not repeat all 3, repeat until they are learned or up to a maximum of 5 times
4. Spell the word WORLD, (you may help the subject to spell world correctly) Say now spell it 5
backwards please. Allow 30 seconds to spell backwards. (If the subject cannot spell world even
with assistance — score 0).
5. Now what were the 3 objects that I asked you to remember? 3
Ball Car Man
Score 1 point for each correct response regardless of order, allow 10 seconds.
6. Show wrisrwatch. Ask: what is this called? Score 1 point for correct response. Accept "wrisrwatch" or 1
"watch". Do not accept "clock", "time", etc. (allow 10 seconds).
7. Show pencil. Ask: what is this called? Score 1 point for correct response, accept pencil only — 1
Score for pen.
8. Id like you to repeat a phrase .liter me: "no, if s, and's, or bin's." (allow 10 seconds for response. 1
Score 1 point for a correct repetition. Must be exact, e.g., no it's or but's — score 0)
Clinical Assessment 221
9. Read the words on this page and then do what it says: Hand subject the laminated sheet with CLOSE 1
YOUR EYES on it.
CLOSE YOUR EYES.
If subject just reads and does not then close eyes — you may repeat: read the words on this page and then
do what it says to a maximum of 3 times. Allow 10 seconds, score 1 point only if subject closes eyes.
Subject does not have to read aloud.
10. Ask if the subject is right or left handed. Alternate right/left hand in statement, e.g., if the subject is 3
right handed, say Take this paper in your left hand . . . Take a piece of paper — hold it up in front of
subject and say the following:
"Take this paper in your right/left hand, fold the paper in half once with both hands, and put the
paper down on the floor."
Takes paper in correct hand
Folds it in half
Puts it on the floor
Allow 30 seconds. Score 1 point for each instruction correctly executed.
11. Hand subject a pencil and paper. Write any complete sentence on that piece of paper. 1
Allow 30 seconds. Score 1 point. The sentence should make sense. Ignore spelling errors.
12. Place design, pencil, eraser and paper in front of the subject. Say: copy this design please. Allow 1
multiple tries until patient is finished and hands it back. Score 1 point for correctly copied diagram.
The subject must have drawn a 4-sided figure between two 5-sided figures. Maximum time — 1 minute.
Total Test Score 30
Source: From D. W. Molloy, E. Alemayehu, and R. Roberts, "Reliability of a Standardized Mini-Mental State Examination compared with the
traditional mini-mental examination." American journal of Psychiatry, January 1991; 148, 102-105. Copyright © 1991 American Psychiatric
Association.
COUNSELING, DIAGNOSIS, AND THE DSM-IV-TR
The roots and tradition of counseling lie in vocational guidance and human devel-
opment (Herr, 1998). However, recent societal and mental health practices have
given rise to a mental health role for professional counselors regardless of work set-
ting. Mental health counselors, substance abuse counselors, marriage and family
counselors, geriatric counselors, and community counselors provide mental health
counseling in clinics, agencies, and private practice in numerous states around the
country — and in numerous countries around the world. Even professional school
counselors and career counselors, two professions that have maintained the closest
ties to counseling's developmental roots and that seldom view clinical diagnosis as a
part of their job functions, provide treatment to clients or students who have been
(or could be) diagnosed with mental or emotional disorders.
Mental and emotional disorders are becoming more prevalent in society, partic-
ularly among children and adolescents, and professional counselors must be knowl-
edgeable about diagnosis and clinical assessment in order to gain respect and parity
222 Chapter 7
in the mental health community. A review of the extant literature finds numerous ex-
amples of increased need for clinical diagnostic and treatment services, a need that
contemporary professional counselors are helping to meet. In any given year, serious
mental illness can be diagnosed in about 5-7% of an adult population (New
Freedom Commission on Mental Health, 2003). Diagnosable mental and emotional
disorders significant enough to warrant treatment can be found in 15-22% of
school-aged students (SAMHSA, 1998), but only about one in five of these impaired
students actually gets help. Clients with serious mental health concerns seeking help
at university counseling centers are increasing (Pledge, Lapan, Heppner, Kivlighan &
Roehlke, 1998). Substance abuse, poverty, and community and domestic violence
are on the rise (Dryfoos, 1994; Lockhart & Keys, 1998). Various estimates of de-
pression among adolescents include 3 to 6 million students (American Psychiatric
Association, 1994) or nearly 18% (Essau, Condradt, & Peterman, 2000). On a re-
lated note, 10,000 to 20,000 adolescents attempt suicide, while more than 2,000
adolescents commit suicide annually (Brown, 1996). This makes suicide the second
leading cause of death among adolescents. Diagnosis of childhood disorders requires
a great deal of improvement as certain common disorders (e.g. AD/HD) appear to
be overdiagnosed in childhood (McClure, Kubiszyn, & Kaslow, 2002), quite a feat
given that community prevalence estimates indicate that perhaps 50% of children
and adolescents referred to mental health clinics can be diagnosed with behavior dis-
orders, including Conduct Disorder and AD/HD (Erk, 1995).
While the above statistics paint a picture of a tremendous societal need for clin-
ical services, they also underscore the necessity of high-level training in diagnosis and
treatment of mental and emotional disorders. Nearly all clinical decisions, whether
diagnostic or treatment related, are predicated on informal or formal assessment pro-
cedures. Thus the more one consciously integrates assessment procedures and out-
comes research into one's practice, the more objective and informed ones practice
becomes. The mental health role of the professional counselor is here to stay; diag-
nosis and use of the DSM is becoming a necessary part of training for all clinicians
(Seligman, 1998), just as the International Classification of Diseases — Tenth Revision
(ICD-10) is used in the health professions.
The usefulness of diagnostic systems is widely debated (see Murphy and
Davidshofer, 2001). The fact of the matter is that insurance companies and employ-
ers are requiring competence in diagnosis as a condition for payment or employ-
ment, and state licensing agencies are increasingly requiring coursework and train-
ing in clinical diagnosis to obtain licensure (Hohensil, 1993; 1996). In the mental
health arena, the diagnostic resource most commonly used by psychiatrists, psychol-
ogists, social workers, and professional counselors is the Diagnostic and Statistical
Manual of Mental Disorders — Fourth Edition — Text Revision (DSM-IV-TR) (APA,
2000). In fact, a recent survey found that 91% of mental health counselors used the
DSM (Mead, Hohensil, & Singh, 1997).
The DSM-IV-TR provides specific criteria through which reliable diagnoses can
be made. It also provides a nomenclature, or common language, through which men-
tal health professionals can communicate with each other to describe (not label) a
client's condition. Such diagnostic language has the purpose of succinctly communi-
cating categorical mental conditions so that common symptoms may be indicated and
Clinical Assessment 223
commonly agreed-upon' treatments may ensue. Such a categorical reference is neces-
sary to help organize the diagnostic and treatment outcome literature. For example,
to move a field forward, it is essential for all clinicians, educators, and researchers to
know exactly what is meant by the term Major Depressive Disorder so that all re-
sources aimed at understanding the identification, treatment alternatives, and treat-
ment outcomes of this disorder can be focused most efficiently. The DSM-TV- TR pro-
vides this common language. Even if some professional counselors (e.g., professional
school counselors and career counselors) do not make diagnoses in their work settings,
understanding what, for example, Major Depressive Disorder entails is essential for
proper assessment, referral, and facilitation or coordination of treatment. For exam-
ple, would a professional school counselor interviewing the mother of a 7-year-old
who complains of her son's problems with disobedience, defiance, and negativity be
serving the best interest of the student or family if he or she were unfamiliar with the
term Oppositional Defiant Disorder (ODD). An awareness of the diagnostic criteria
for ODD would streamline the assessment process and allow for efficient referral or
treatment. A working knowledge of the DSM-TV-TR makes any professional coun-
selor more efficient and valuable. While there is no substitute for a careful perusal of
the DSM-TV-TR, the remainder of this chapter briefly reviews the multiaxial assess-
ment system of the DSM-IV-TR, major diagnostic categories, and several instruments
that are particularly helpful in the clinical assessment process.
Using the DS/W-/V-FR-Multiaxial Diagnosis
The DSM-IV-TR (APA, 2000) is the latest in a series of diagnostic resource guides.
The DSM-IV-TR is a text revision of the DSM-IV (APA, 1994), with editorial
changes primarily to the information supplied in the text, rather than to the diagnos-
tic criteria sets for the specified disorders. The DSM-IV-TR describes nearly 300 di-
agnostic categories that enable mental health professionals to diagnose, treat, re-
search, and efficiently discuss mental and emotional disorders.
The diagnostic process calls for a multiaxial classification system to describe the
condition of the client. Five axes, or different facets, are included:
■ Axis I — Clinical disorders and other conditions that may be a focus of clinical
attention
■ Axis II — Personality disorders and mental retardation
■ Axis III — General medical conditions
■ Axis IV — Psychosocial and environmental problems
■ Axis V — Global assessment of functioning
The systematic multiaxial approach provides a shorthand notation of a compre-
hensive process, conveying a tremendous amount of information about the current
mental status of a client, including mental disorders, concurrent medical issues, and
adaptive functioning. APA (2000, p. xxxi) defines a mental disorder as a
clinically significant behavior or psychological syndrome or pattern that occurs
in an individual and that is associated with present distress (e.g., a painful symp-
tom) or disability (i.e., impairment in one or more areas of functioning) or with
224 Chapter 7
a significantly increased risk of suffering death, pain, disability, or an important
loss of freedom.
Axes I and II include the mental disorders that make up the classification sys-
tem. Axis II includes personality disorders and mental retardation, while Axis I is
used to document the existence of all other mental disorders. The behavioral effects
of physical and medical disorders are listed on Axis III. The listing of occupational,
familial, financial, legal, and other social and emotional effects is noted on Axis IV.
And the professional counselor's assessment of how well the client is, or has been,
adapting to the stresses of everyday life is recorded on Axis V.
The DSM-IV-TR provides comprehensive information about mental disorders
by describing essential diagnostic features, associated features and disorders, specific
age and gender features, prevalence, course of the disorder, familial pattern, and dif-
ferential diagnosis. Most importantly, the diagnostic code and criteria for each dis-
order are provided. These criteria enhance the reliability and validity of the diagnos-
tic system by providing specific descriptions of symptoms and conditions relevant to
diagnosis. The criteria are meant to be so specific that, regardless of the clinician as-
sessing the client, a similar diagnostic outcome should emerge. As examples, Table
7.4 contains the diagnostic criteria for Posttraumatic Stress Disorder (PTSD) (APA,
2000, pp. 467-468) and Table 7.5 for Attention-Deficit Hyperactivity Disorder —
Combined Type (AD/HD) (APA, 2000, p. 92; symptom criteria only).
Note how the specificity of the criteria allows for clinicians to reliably determine
whether the disorder applies to a given client. This allows numerous clinicians as-
sessing the same client to arrive at a consistent determination as to whether a client
meets the specified diagnostic criteria. Accurate diagnosis occurs to a large extent be-
cause professional counselors ask specific questions about client symptoms as neces-
sary. It is better to ask a specific question or seek information of a specific nature and
receive a negative reply than to not ask and therefore not know whether a client pres-
ents with a given disorder. Clinical diagnosis is a process in which it is generally good
advice and good practice to leave no stone left unturned.
It is essential that professional counselors adhere closely to the diagnostic crite-
ria provided in the DSM-IV-TR, as short- and long-term damage to clients can re-
sult from misdiagnosis. In the short term, misdiagnosis can cause a client to receive
an inappropriate treatment and accrue unnecessary expense and wasted time. In the
long term, an incorrect diagnosis can follow a client, as insurance companies and
healthcare professionals may make future decisions about treatment based on faulty
past information. These entities also may not always keep such private information
confidential.
The remainder of this chapter provides an orientation to diagnosis and classifi-
cation using the multiaxial framework. Professional counselors wanting additional
training and practice with clinical diagnosis are encouraged to take graduate course-
work in which the DSM-IV- I'R diagnostic system is prominently featured and super-
vised training is provided. In addition, other text resources are available, including
(he DSM-IV Casebook (Spitzer, Gibbon, Skodol, Williams, & First, 1994) and the
DSM-IV Guide (Frances, First, & Pincus, 1995).
Clinical Assessment 225
Table 7.4 Diagnostic criteria for Posttraumatic Stress Disorder (PTSD)
A. The person has been exposed to a traumatic event in which both of the following were
present:
(1) the person experienced, witnessed, or was confronted with an event or events that
involved actual or threatened death or serious injury, or a threat to the physical integrity
of self or others
(2) the person's response involved intense fears, helplessness or horror. Note: In children,
this may be expressed instead by disorganized or agitated behavior
B. The traumatic event is persistently reexperienced in one (or more) of the following ways:
(1) recurrent and intrusive distressing recollections of the event, including images,
thoughts, or perceptions. Note: In young children, repetitive play may occur in which
themes or aspects of the trauma are expressed
(2) recurrent distressing dreams of the event. Note: In children, there may be frightening
dreams without recognizable content
(3) acting or feeling as if the traumatic event were recurring (includes a sense of reliving the
experience, illusions, hallucinations, and dissociative flashback episodes, including those
that occur on awakening or when intoxicated). Note: In young children, trauma specific
reenactment may occur
(4) intense psychological distress at exposure to internal or external cues that symbolize or
resemble an aspect of the traumatic event
(5) physiological reactivity on exposure to internal or external cues that symbolize or
resemble an aspect of the traumatic event
C. Persistent avoidance of stimuli associated with the trauma and numbing of general respon-
siveness (not present before the trauma), as indicated by three (or more) of the following:
(1) efforts to avoid thoughts, feelings, or conversations associated with the trauma
(2) efforts to avoid activities, places, or people that arouse recollections of the trauma
(3) inability to recall an important aspect of the trauma
(4) markedly diminished interest or participation in significant activities
(5) feeling of detachment or estrangement from others
(6) restricted range of affect (unable to have loving feelings)
(7) sense of foreshortened future (e.g., does not expect to have career, marriage, children, or
a normal lifespan)
D. Persistent symptoms of increased arousal (not present before the trauma), as indicated by
two (or more) of the following:
(1) difficulty falling or staying asleep
(2) irritability or outbursts of anger
(3) difficulty concentrating
(4) hypervigilance
(5) exaggerated startle response
E. Duration of the disturbance (symptoms in Criteria B, C, and D) is more than 1 month.
F. The disturbance causes clinically significant distress or impairment in social, occupational,
or other important areas of functioning.
Specify if:
Acute: if duration of symptoms is less than 3 months
Chronic: if duration of symptoms is 3 months or more
Specify if:
With Delayed Onset: if onset of symptoms is at least 6 months after the stressor
Source: Reprinted with permission from the Diagnostic and Statistical Manual of Mental Disorders, (4th ed.,
text rev.), American Psychiatric Association. Copyright 2000, Washington, DC: Author.
226 Chapter 7
Table 7.5
Diagnostic criteria for Attention-Deficit Hyperactivity Disorder-
Combined Type (inattentive and hyperactive impulsive symptoms only)
A. Either (1) or (2):
(1) six (or more) of the following symptoms of inattention have persisted for at least 6
months to a degree that is maladaptive and inconsistent with developmental level:
Inattention
(a) often fails to give close attention to details or makes careless mistakes in
schoolwork, work, or other activities
(b) often has difficulties sustaining attention in tasks and play activities
(c) often does not seem to listen when spoken to directly
(d) often does not follow through on instructions and fails to finish schoolwork,
chores, or duties in the workplace (not due to oppositional behavior or failure to
understand instructions)
(e) often has difficulty organizing tasks and activities
(f) often avoids, dislikes, or is reluctant to engage in tasks that require sustained mental
effort (such as schoolwork or homework)
(g) often loses things necessary for tasks or activities (e.g., toys, school assignments,
pencils, books, or tools)
(h) is often easily distracted by extraneous stimuli
(i) is often forgetful in daily activities
(2) six (or more) of the following symptoms of hyperactivity-impulsivity have persisted for
at least 6 months to a degree that is maladaptive and inconsistent with developmental
level:
Hyperactivity
(a) often fidgets with hands or feet or squirms in seat
(b) often leaves seat in classroom or in other situations in which remaining seated is
expected
(c) often runs about or climbs excessively in situations in which it is inappropriate (in
adolescents or adults, may be limited to subjective feelings of restlessness)
(d) often has difficulty playing or engaging in leisure time activities quietly
(e) is often "on the go" or acts as if "driven by a motor"
(f) often talks excessively
Impulsivity
(g) often blurts out answers before questions have been completed
(h) often has difficulty awaiting turn
(i) often interrupts or intrudes on others (e.g., butts into conversations or games)
Source: Reprinted with permission from the Diagnostic and Statistical Manual of Mental Disorders, (4th ed.,
text rev), American Psychiatric Association. Copyright 2000, Washington, DC: Author.
Axis I Disorders-Clinical Disorders and Other Conditions
That May Be a Focus of Clinical Attention
Axis I disorders include all of the disorders from the DSM-IV-TR except for mental
retardation and personality disorders (see Table 7.6). It is essential to understand
from the outset that a minority of clients actually enter the clinical arena with only
Clinical Assessment 227
Table 7.6 DSM-IV-TR Axis I clinical disorders and other conditions
that may be a focus of clinical attention
Disorders usually first diagnosed in infancy, childhood, or adolescence
1. Delirium, dementia, and amnestic and other cognitive disorders
2. Mental disorders due to a general medical condition
3. Substance-related disorders
4. Schizophrenia and other psychotic disorders
5. Mood disorders
6. Anxiety disorders
7. Somatoform disorders
8. Factitious disorders
9. Dissociative disorders
10. Sexual and gender identity disorders
1 1 . Eating disorders
12. Sleep disorders
13. Impulse-control disorders not elsewhere classified
14. Adjustment disorders
15. Other conditions that may be a focus of clinical attention
a single well-defined problem. It is common for a client to obtain multiple diagnoses
on Axis I and/or Axis II, referred to as comorbidity. Clark, Watson, and Reynolds
(1995) found that 60-80% of clients present with comorbidity, while only about
20-40% present with a singular diagnosis. This reality makes diagnosis of the typi-
cal client somewhat complicated. Therefore, professional counselors must start by
looking at the big picture of all characteristics and symptoms, then refine the ques-
tioning to arrive at more specific categorical decisions. This diagnostic decision-mak-
ing process is explained in more detail at the end of this chapter. Sometimes a client
may not meet all criteria for a given disorder, so each Axis I disorder allows for the
designation "Not Otherwise Specified" (NOS) to be used; however, this designation
should be used with caution because it may lead to misdiagnosis and inappropriate
treatment if misused.
Report all applicable disorders on Axis I, specifying the primary diagnosis by
listing it first and designating that it was the difficulty that prompted the office visit
(in an outpatient setting, state "reason for visit") or inpatient stay (state "principle
diagnosis"). Finally, severity specifiers may follow the disorder to denote the nature
of the disorder. Course specifiers and descriptors include Mild, Moderate, Severe, In
Partial Remission, In Full Remission, and Prior History (APA, 2000). Each of these
is explained in detail. For example, Severe is described as "many symptoms in excess
of those required to make the diagnosis, or several symptoms that are particularly se-
vere, are present, or the symptoms result in marked impairment in social or occupa-
tional functioning" (p. 2).
Numerous other conditions are included in the DSM-IV-TR that present with
clinical relevance deserving of attention, but are not considered a mental disorder.
Many of these more developmental conditions are referred to as " V-Codes" and all
228 Chapter 7
Table 7.7 Other conditions that may be the focus of clinical attention
Psychological factors affecting medical conditions
Mental disorders
Psychological symptoms
Personality traits or coping style
Maladaptive health behaviors
Stress-related physiological response
Medication-induced movement disorders
Neuroleptic-induced
Parkinsonism
Malignant syndrome
Acute dystonia
Acute akathsia
Tardive dyskinesia
Medication-induced postural tremor
Other Medication-induced disorder
Adverse effects of medication NOS
Relational problems
Relational problem related to a mental disorder or general medical condition
Parent-child relational problem
Partner relational problem
Sibling relational problem
Problems related to abuse or neglect
Physical abuse of child
Sexual abuse of child
Neglect of child
Physical abuse of adult
Sexual abuse of adult
Additional conditions that may be a focus of clinical attention
Noncompliance with treatment
Malingering
Adult antisocial behavior
Child or adolescent antisocial behavior
Borderline intellectual functioning
Age-related cognitive decline
Bereavement
Academic problem
Occupational problem
[dentin' problem, religious or spiritual problem
Acculturation problem
Phase-of-life problem
are coded on Axis I (except Borderline Intellectual Functioning). Fortunately, most
of the conditions have titles that are self-explanatory, so rather than expanding on
each, we present all ol these conditions in Table 7.7.
Clinical Assessment 229
Axis II Disorders-Personality Disorders and Mental Retardation
Axis II disorders are inflexible and enduring conditions that cause significant impair-
ment in social, occupational, academic, or other adaptive functioning. While most
clients will seek or be referred for treatment because of more acute problems or men-
tal disorders on Axis I, Axis II disorders may also be present, though not necessarily
responsible for prompting the referral. Personality disorders also often exacerbate
Axis I conditions. Importantly, clients presenting with Axis II disorders are fre-
quently less capable of accurate symptom self-report. This, coupled with generally
less precise diagnostic criteria, makes diagnosis of personality disorders a challenging
endeavor (Fong, 1995). Axis II disorders include mental retardation and personality
disorders.
Personality disorders have been categorized according to the following clusters:
■ Cluster A: Paranoid Personality Disorder, Schizoid Personality Disorder,
Schizotypal Personality Disorder
■ Cluster B: Antisocial Personality Disorder, Borderline Personality Disorder,
Histrionic Personality Disorder, Narcissistic Personality Disorder
■ Cluster C: Avoidant Personality Disorder, Dependent Personality Disorder,
Obsessive-Compulsive Personality Disorder
Such a clustering scheme does not preclude an individual from having co-
occurring personality disorders across two or more clusters. In addition, the DSM-
IV-TR allows diagnosis of Personality Disorder — NOS for individuals who display
characteristics of one or more personality disorder but do not fulfill all specific cri-
teria in a given classification.
Axis Ill-Current Medical Conditions
Axis III is utilized for the report of current general medical conditions of potential
relevance to a client's current mental disorders or conditions and treatment (APA,
2000). If a medical condition causes the disorder, it should not be listed on Axis III,
as it should already be included on Axis I (e.g., Personality Change Due to a General
Medical Condition). However, if the general medical condition is a direct physio-
logical result of a mental disorder, then Mental Disorder Due to a General Medical
Condition should be listed on Axis I, with the general medical condition noted on
both Axis I and Axis III. In other words, the purpose of Axis III is to allow descrip-
tion of medical conditions that are not the direct cause of a mental disorder, but
which must be considered when planning a client's treatment. For example, if a
client presents with depressive symptoms that are believed to give rise to a client's
hypothyroidism, the Axis I diagnosis should be Mood Disorder Due to
Hypothyroidism, With Depressive Features, and Hypothyroidism should again be
included on Axis III. The general medical conditions used on Axis III are those not
included in the chapter on Mental Disorders in the International Classification of
Diseases (ICD-9-CM) and are important to include in a multiaxial diagnosis because
these conditions may affect a managed care organization's decision to continue
230 Chapter 7
Table 7.8 Categories of psychosocial and environmental problems
Problems with primary support
Problems related to the social environment
Educational problems
Occupational problems
Housing problems
Economic problems
Problems with access to healthcare services
Problems related to interaction with the legal system or crime
Other psychosocial and environmental problems
treatment. If no Axis III diagnosis is evident, clinicians should provide the designa-
tion "None." If the Axis III diagnosis will be made pending further evaluation, cli-
nicians should provide the designation "Deferred."
Axis IV-Psychosocial and Environmental Problems
Axis IV is used to report environmental and psychosocial problems that may be in-
fluencing diagnosis, treatment planning, and eventual prognosis of a client's mental
disorder(s). Examples include the death or loss of a family member, close friend, or
job; estrangement, separation, or divorce; academic problems; poverty, homelessness,
or inadequate healthcare. For convenience, Table 7.8 lists the common categorical
designations included in the DSM-IV-TR (APA, 2000). While these problems are
typically listed on Axis IV, if these problems constitute the reason the client is seek-
ing treatment, it is appropriate to list them on Axis I while specifying "Other
Conditions That May Be the Focus of Clinical Attention."
Axis V-Global Assessment of Functioning (CAF)
Axis V allows the clinician to provide an assessment of the clients overall level of func-
tioning, using what APA (2000) refers to as the Global Assessment of Functioning
(GAF). This assessment reflects one's professional judgment and is useful in treatment
planning and outcome assessment. The GAF indicates a client's current level of func-
tioning unless otherwise noted; at times, the clinician may want to indicate the clients
highest level of overall functioning during the past three months or even the previous
year. The GAF should not involve a reflection of the client's physical or environment
problems or limitations, only the client's functioning in the social, occupational, or
psychological areas. Reported as "GAF = ###" on Axis V, the GAF scale ranges from
to 100, subdivided by sublevels often 10-point ranges. The higher the GAF, the
higher the client's level of functioning. Tible 7.9 contains the GAF scale descriptors
(APA, 2000). Each is explained in greater detail. For example, a GAF between 41 and
50 indicates "Serious symptoms (e.g., suicidal ideation, severe obsessional rituals, fre-
quent shoplifting) or any serious impairment in social, occupational, or school tunc-
Clinical Assessment 231
Table 7.9 Global Assessment of Functioning (GAF) designations
91-100 Superior functioning
81-90 Absent or minimal symptoms
71-80 Transient and expectable reactions
61-70 Mild symptoms
51-60 Moderate symptoms
41-50 Serious symptoms
31—40 Some impairments in reality testing or communication
21-30 Delusions, hallucinations, or serious impairment in judgment
11-20 Some danger to self or others
1-10 Persistent danger to self or others
Inadequate information
Source: Reprinted with permission from the Diagnostic and Statistical Manual of Mental Disorders, (4th ed.
text rev.), p. 34. American Psychiatric Association. Copyright 2000, Washington, DC: Author.
tioning (e.g., no friends, unable to keep a job)," while a GAF between 51 and 60
indicates Moderate symptoms (e.g., flat affect and circumstantial speech, occasional
panic attacks) or moderate difficulty in social, occupational, or school functioning
[e.g., few friends, conflicts with peers or co-workers (p.34)]."
For instances in which a clinician might wish to separately assess individual
components of functioning, rather than an overall level, APA (2000) provides a
Social and Occupational Functioning Assessment Scale (SOFAS), a Global
Assessment of Relational Functioning (GARF), and a Defensive Functioning Scale
(DFS).
Diagnostic Decision Making Using the DSM-IV-TR
The five axes reviewed above can be combined to construct a systematic and com-
prehensive DSM-IV-TR multiaxial assessment system (APA, 2000) that describes a
client's mental disorder(s), medical condition(s), environmental and psychosocial
factors, and overall level of functioning. The multiaxial system is designed to pro-
vide organized, substantive communication about complex diagnostic situations.
Professional counselors are encouraged to provide the complete five-axial diagnosis
for every client in order to effectively communicate the diagnosis to other profes-
sionals and plan an effective treatment regimen (Fong, 1995).
Multiaxial diagnosis is a complicated process, and mastery requires substantial
education, training, and practice under supervision. While master clinicians can
sometimes reach reliable and accurate diagnostic decisions based on clinical experi-
ence, many clinicians find it helpful to use a structured decision-making process.
Figure 7.2 presents a structured process clinicians may find helpful in guiding diag-
nostic decision making. This flow chart guides the clinician through a process in
which very general questions can lead to deeper examination using the decision trees
provided in the DSM-IV-TR. For example, consider the case of an adult undergoing
232 Chapter 7
a stressful divorce and employed by a company undergoing downsizing who presents
with symptoms of depression. These symptoms have been occurring for about four
weeks and have led to intense feelings of hopelessness, weight loss, and insomnia.
On the flow chart, this case would be tracked through the Axis I Disorders category
and pursued with the DSM-IV-TR decision tree for differential diagnosis of Mood
Disorders, eventually resulting in a probable diagnosis of Major Depressive Disorder,
Single Episode (assuming this was the first time depressive symptoms were displayed
to this degree). If the client has no enduring personality disorders or complicating
medical conditions, this client's symptoms may result in the following multiaxial
diagnosis:
■ Axis I 296.22 — Major Depressive Disorder, Single Episode, Moderate
Without Psychotic Features
■ Axis II None
■ Axis III None
■ Axis IV Disruption of family by divorce, threat of job loss
■ Axis V GAF = 55 (current)
Note that in the example above, the initial question involved whether a client's
symptoms constituted a possible mental disorder. If the answer had been no, the
process would have stopped right there because the DSM-IV-TR is helpful only in di-
agnosing mental disorders, and no diagnosis would have been warranted. Also, note
that the depressive symptoms were relatively recent and acute, not enduring, persist-
ent, and inflexible. Thus it was judged that a Personality Disorder (or Mental
Retardation) was not evident and that the condition was likely a mental disorder lo-
cated on Axis I. If a Personality Disorder was indicated, exploration of these disor-
ders would commence, followed by a return to consideration of Axis I disorders to
address the more acute symptoms. When pursuing an Axis I diagnosis, the clinician
needs to address each query in the remainder of the flow chart (and subsequent de-
cision trees, if necessary) to ensure a comprehensive diagnosis. As mentioned above,
most clients present with more than one condition, so the experienced clinician ap-
proaches each client's diagnosis with an eye toward "leaving no stone unturned."
While this complicates the diagnostic process, a comprehensive diagnosis generally
improves the prospects for treatment, because multimodal treatment strategies can
be undertaken to address all areas of concern. Note that the DSM-IV-TR has numer-
ous other mental disorders that are not accounted for by the flow chart until the final
"catchall" box on the decision tree. The burden for comprehensive diagnostic work
always relies on the competence and experience of the clinician. For further discus-
sions and applications of multiaxial diagnosis, the reader is referred to the DSM-IV
Casebook (Spitzer et al., 1994).
Finally, cultural considerations must always be monitored throughout the diag-
nostic and treatment processes and are ultimately the responsibility of the clinician.
The DSM-IV- 77? provides discussions of relevant cultural considerations for most of
the disorders, and Appendix I of the DSM-IV-IR contains a glossary of culture-
bound syndromes, including descriptions and relevance to psychopathology.
oil
-c j= g.E g £r
01 *_.
o c
Q <U
■s 5 -S 5 I 8 -
£ E.E: . n~
o-o'Z "
Oil
1 ■£ ■§ 5 g o 7
o "o <
e i
O lO
; c <u o cu o in
! do c i/i <u in r^
■S i 2 s ™
» -E .3 IT u c ,
^ <u a> = c
w *- cl > c .~
O C (U J¥ «3 £
Oil
l/l
in
X)
c
CD
E
o
l/>
x;
tu
CL
F
t
O
c
a.
E
HI
E
cl J
u
§
in
-O
JC
in
o
a <
o 8"2<o
cio«o
00^ .£ CU fN
2 Oil
u - E g ,
3 O SWO «
2 ° « !2 < In
u o -c E ^- O o
i .s £ s a ^
; s | »■§ i -i
l. »- DOS fe
•c <u
.2 -o
Q. fO
It*
l/> QJ
■g'K
Si g.
oj _tu 3 m g a. -
T3 (ii ni C v/> -n S?_h fl. S O "T
Z Q
JS « c 3 £ & o .
»j c cu <y ^ ^ o »x>
(13 5 O
at <N r-
n
Q
Q
n
• o.
u Q-
^
&%
ai c
u
i! nj"D
V 3
US
<
at
LJ
u
Q
O
CL
gs
<
aj
"g
CO
c .2
■- £
c
0J
_
00
QJ
n
C
o
■n
5
«j
O
QJ
c
o
u_
- a
.. Q Q a.
< a. a. —
(0
■£ ° ° ■&"
2-n .
CD
TJ
to
Q
«
3
DO
233
234 Chapter 7
Think About It 7.3 Think of a client or associate who is experiencing a
mental or emotional problem. What is the problem, its severity, and its envi-
ronmental influences and consequences? How can you explain the difficul-
ties from a developmental perspective? What approach(es) could you use to
help? Next, using the DSM-IV-TR, attempt to understand the individual's
issue using the multiaxial system. What treatment approach(es) could be
used? Finally, what similarities and differences did you note between the de-
velopmental and clinical approaches employed?
USING CLINICAL INVENTORIES AND TESTS IN COUNSELING
Information Sources for Clinical and Personality Assessment
Piedmont (2006) suggested that information on clients be gathered through four
different and complementary sources: life outcomes, observer rating, self-report rat-
ings, and test data (LOST). Each information source has strengths and limitations,
but accessing information from each source frequently provides a synergistic effect
that offers a balanced and confirmatory approach to a comprehensive evaluation.
Life outcomes data include the factual information about a client that can often be
collected during an intake interview: "Has the client ever been married?" "How
many children?" "Has the client ever received counseling services in the past?" "If so,
for what and with what result?" Each question reveals certain factual information the
professional counselor needs to understand the client's life history, properly diagnose
or understand current complaints, and develop an effective treatment plan.
Generally, life outcome data are factual, unambiguous, and objective, although con-
firmation of client report is always advisable. These data can be obtained from school
or medical records, legal or civic records, directly from the client during a written or
oral intake interview, or through direct assessment during a structured or semi-struc-
tured clinical interview. Comprehensive attempts at structuring the collection of per-
sonal histories include the Personal History Checklist (Schinka, 1989) and Mental
Status Checklist (Schinka, 1988).
Observer ratings involve the report of observations of clients by significant and
informed people in their lives. Parents and teachers are often in a good position to
rate and evaluate the behaviors of children. Likewise, spouses and some friends or
peers make be able to provide helpful insight and observations on an adult client.
Importantly, a rating scale is an attempt to objectify someone's subjective perceptions.
As such, caution over the veracity and honesty of the ratings must be taken into ac-
count. The key is to capture the perceptions of several different sources of information
so that a clinician can perform cross-validation and determine the robustness or con-
vergence of various informant perceptions. For example, if a client is referred for de-
pression, reports she is depressed, and rates herself as depressed, she may very well be
depressed. But if her parents and teachers do not rate her as being depressed, it is likely
that something more complex is occurring. If, on the other hand, her parents and
teachers confirm her depression, the case becomes clearer. Such is the value of other-
Clinical Assessment 235
report observer ratings. While observer ratings can be just as biased as self-report rat-
ings, the bias is of a different type and therefore usually adds more clarity than confu-
sion. Indeed, McCrae & Costa (1987) and Piedmont (1994) reported convergence of
perspectives of observers to be robust and helpful in confirming personality traits.
Many clinical and personality tests have observer report versions, and more of these
include validity scales to help clinicians determine the veracity of results.
Self-report ratings are most commonly used in clinical and personality assess-
ment because professional counselors nearly always have direct access to the client,
and client perspectives are essential to effective treatment planning, even in cases
when they are less than cooperative. Self-report instruments are frequently referred
to as objective tests, even though, like observer ratings, professional counselors are
best advised to view them as attempts to objectify the subjective perceptions of
clients. Some clients present themselves in a biased manner, and clinicians must be
wary of the impact such bias may have on test results. Many self-report scales include
validity scales to help clinicians determine likely inaccuracies of self-perception and
outright dishonesty in a client's self-presentation. In spite of the potential limitation
of bias, self-report rating scales have two major strengths. They allow: (1) compari-
son of a client's self-ratings to a norm sample (i.e., are norm referenced), and (2) di-
rect assessment of client thoughts, feelings, and behaviors, which are all facets of a
client's mental state and personality functioning (Piedmont, 2006). Many of the
tests reviewed throughout the remainder of this chapter are self-report inventories.
Test data involve the use of instruments to directly assess client functioning.
Importantly, such instrumentation measures information that clients either do not
know they are producing, or are unaware of how the information will be interpreted.
Physiological measures fall into this category (e.g., galvanic skin response, electro-
cardiogram). In clinical and personality assessment, projective tests are examples of
collecting test data. In general, projective tests present a client with ambiguous stim-
uli, such as inkblots, incomplete sentences, or pictures about which a client tells a
story. The client, unaware of the purpose of the activity or the meaning of responses,
projects thoughts and feelings onto the stimuli. Clinicians then interpret these re-
sponses to understand the client's underlying needs, drives, motivations, thoughts,
and emotions. Test data have the advantage of being difficult to "fake," thus reduc-
ing the opportunity to bias the results. While projective test data are certainly used
by some clinicians for diagnostic purposes, the psychometric properties of most pro-
jective tests do not support their use for this purpose. On the other hand, rich de-
scription and understanding of client personality can often be derived from projec-
tive techniques by skilled professional counselors. Thus an expanded discussion of
projective assessment and commonly used projective tests will be provided at the end
of this chapter within the context of personality assessment.
How Clinical and Personality Test Content Is Developed
Clinical and personality inventories are generally multidimensional tests composed of
several to numerous scales. Each of these scales is supposed to provide a helpful addi-
tion to the overall test, usually measuring some unique or important facet of the over-
all construct being measured. Four primary methods are used to construct clinical and
236 Chapter 7
personality inventories: content validation, theory, empirical-criterion keying, and
factor analysis. Content validation relies on the logical process of deductive reasoning
to determine the items that are assigned to a given scale. Each item under considera-
tion may be included on the scale if the test developer determines (through logical
analysis) that it contributes to the measurement of the concept under study (e.g.
Major Depression, Schizophrenia, General Anxiety Disorder). Scales such as the
Woodwortb Personal Data Sheet and the Edwards Personal Preference Schedule were con-
structed using the content validation method.
Theories are sometimes used to develop test items and scales. The theory guides
item development and categorical assignments to potential subscales. An example of
a popular test designed using an underlying theory is the Myers-Briggs Type Indicator
(MBTI), which is based on Jung's theory. To be fair, many other inventories also use
content validation of a theory at an early phase of test development but subsequently
use one of the next two procedures to complete the instrument design (see the dis-
cussion of bootstrapping included in Chapter 6).
Empirical-criterion keying is a procedure in which selected items are adminis-
tered to both nonclinical samples (individuals without the diagnosis) and clinical
samples (individual with the diagnosis). While this process can sometimes use com-
plex analyses, simply put, the items that identify the clinical group and not the non-
clinical group are selected to comprise that particular clinical scale. The MMPI-2,
MMPI-A, and California Personality Inventory are among the better-known tests
using the empirical-criterion key method. For example, the MMPI-2 Depression
clinical scale (D), is comprised of 57 items, many of which are obviously related to
depression (i.e., have face validity) and some that leave examiners wondering how
the item could possibly be related to depression. What is the "rational" or "logical"
connection? The connection is that the individuals with depression comprising the
clinical sample endorsed the item significantly more frequently that the nonclinical
sample of individuals without depression. Thus the "logic" is that there is something
about the item that makes it relate to responses of individuals with depression, even
though the link may not be obvious or rationally determined.
Factor analysis has risen in prominence as a procedure for scale construction
over the past half century due to the advent of high-speed computers. As described
in Chapter 6, factor analysis is an item-sorting technique based on item intercorre-
lations, and the subsequent correlation between each item and derived dimensions
or components, called factors. The factors are subsequently named and may or may
not be "pure measures" of any given clinical diagnosis or personality trait. Each fac-
tor is a statistical entity that has been empirically derived and which can be studied
and refined through further research and test development. The 16PF and NEO-PI-
R are examples of empirically derived tests constructed through the use of factor
analysis. Factor analysis has contributed to an explosion of clinical and personality
inventories. Of course, the primary criticism of the use of factor analysis is that it de-
rives statistical models of item relationships, rather than theoretical models of item
relationships. That is, many test developers put too much faith in factor analysis and
actually use it to design the test, rather than constructing the test using a theoretical
model and using factor analysis to explore the dimensions underlying the test and
confirming the original design.
Clinical Assessment 237
As mentioned earlier, some clinical and personality inventories use one or more
of these three design methodologies. Regardless of the test development procedure,
numerous studies must be undertaken to explore the reliability and validity of test
scores across various samples and for various purposes before the test is ready for
widespread use in clinical decision making.
SOME COMMONLY USED CLINICAL
ASSESSMENT INVENTORIES
Professional counselors in clinical practice may rely heavily on objective clinical in-
ventories when exploring a client's presenting problem, diagnosing client symptoms,
developing a treatment plan, and determining the effectiveness of therapeutic inter-
ventions. Numerous clinical inventories have been developed, and this section pre-
sents a basic review of more than 1 5 of those most commonly used by professional
counselors in clinical practice. As with any of the tests reviewed throughout this
book, more in-depth information on administration, scoring, interpretation, and
technical characteristics can be found in the test manual, Mental Measurements
Yearbook reviews, and the extant literature.
Minnesota Multiphasic Personality Inventory-Second Edition
(MMPI-2)
The Minnesota Multiphasic Personality Inventory — Second Edition {MMPI-2 (Butcher
et al., 1989) is a 567-item, true-false, self-report inventory designed to assess some
of the major patterns of personality in adults ages 18-90 years. Items measure 6 va-
lidity indicators, 10 clinical scales (see Table 7.10), and numerous supplementary,
clinical component, content scales, and clinical subscales (see Table 7.1 1). Some ad-
vocate for the use of Clinical scale patterns to provide quick insight into client diag-
nosis and personality, rather than relying on interpretation of individual scales.
Patterns are represented by reporting the Clinical Scale numeric designation for the
two or three highest scale scores. For example, if the client's highest score is on scale
2 (Depression), and the client's second highest score is on scale 7 (Psychasthenia),
the pattern would be "27." Numerous books written about the MMPI and MMPI-
2 provide interpretive suggestions applicable to pattern analysis.
The restandardization sample (n = 2,600) consisted of paid volunteer adults
(1,138 men and 1,462 women) recruited from seven states, a federal Indian reserva-
tion, and four military bases via random mailings and advertisements. Biographical
data and information about recent stressful life events were also collected (Nichols,
1 992). Hispanic and Asian American subgroups were underrepresented in the norma-
tive sample, whereas Native Americans were overrepresented (Butcher et al., 2001).
The MMPI-2 takes about 60 to 90 minutes to complete and can be scored by
hand in 30 to 60 minutes, or in about 5 minutes by computer. Sample items in-
clude "Spirits sometimes speak to me," "I am as happy as others seem to be," and
"I dread the thought of a hurricane." Convenient score profiles are available to plot
238 Chapter 7
and transform raw scores into T scores. Test-rerest coefficients based on 82 males
and 1 1 1 females with a median interval of seven days ranged from 0.54 (females
on the Sc scale) to 0.93 (males on the Si scale) on the Clinical scales, 0.77 (males
on the BIZ scale) to 0.91 (males and females on the SOD scale) on the Content
scales, and 0.63 (males on the MAC-R scale) to 0.91 (males and females on the A
scale) on Supplementary scales. Internal consistency estimates ranged from 0.56
to 0.87 (except for the Pa scale, which yielded coefficients 0.34 for males and 0.39
for females) on the Clinical scales and 0.68 (females on the TPA scale) to 0.86
(males and females on the CYN and DEP scales, respectively) on the Content scales
(Butcher et al., 2001).
In general, caution is warranted when using the MMPI-2 for diagnostic pur-
poses. Low scale reliabilities (<0.90) make the MMPI-2 more helpful as a test for un-
derstanding individual pathology and exploring intrapersonal hypotheses than for
making diagnoses. The MMPI-2 is a Level C instrument and requires proficiency in
reading English at the 8th-grade level. The clinician should note the inclusion of sev-
eral helpful validity scales. The L scale identifies individuals presenting themselves
in a favorable light, the K scale is a measure of defensiveness, and the F scale is de-
signed to detect clients who randomly respond, cannot understand the items, or are
attempting to fake bad (Erford, 2006). The VRIN and TRIN (validity scales) help
determine if a subject responded in an inconsistent or contradictory way. Although
the MMPI-A (Adolescent version) is designed for adolescents ages 14—18 years, the
MMPI-2 is more appropriate for 18-year-olds living independently from their par-
ents (Butcher et al., 2001). Clinicians should also note that Hispanics, Asian
Americans, and older women were underrepresented in the restandardization of the
Table 7.10 MMPI-2 Clinical scale descriptions
Clinical scale designations Description
1
Hs
Hypochondriasis
2
D
Depression
3
Hy
Hysteria
Excessive health concerns, somatic complaints, narcissism, self-centeredness
Depression, brooding, discouragement, pessimism, hopelessness
Sensory or physical complaints of no organic cause, immaturity, physical
complaints, denial of aggression, need for affection
Pd Psychopathic deviation Antisocial/Asocial behavior, impulsivity, immaturity, lack of concern over social
and moral standards of conduct
Masculine and feminine interests
Paranoia, suspicion, hostility, psychotic behavior, cynicism, excessive moral virtue
Anxiety, obsessions, compulsions, exaggerated fears, difficulty concentrating,
physical complaints
Withdrawal, social/emotional alienation, thought disturbance, bizarre sensory
experiences, lack of ego mastery
High energy, elated mood, low frustration tolerance, denial of social anxiety
Introversion, shyness, neurotic maladjustment, self-depreciation
Source Manual for Administration, Suiting, and Interpretation of the Minnesota Multiphasic Personality Inventory — Third Edition by Kuulur et al.,
(2001). Minneapolis: University of Minnesota Press.
5
Mf
Masculinity/Femininity
6
Pa
Paranoia
7
Pt
Psychasthenia
8
Sc
Schizophrenia
9
Ma
1 lypomania
Si
Social introversion
Clinical Assessment 239
Table 7.1 1 Scales and subscales derived from MMPI-2 items
Validity scales
— Cannot Say (?) (reported as a raw score
only, not plotted)
VRIN — Variable response inconsistency
TRIN — True response inconsistency
F — Infrequency
F B — Back F
F p — Infrequency-Psychopathology
L— Lie
K — Correction
S — Superlative self-presentation
Superlative self-presentation subscales
Sj — Beliefs in human goodness
S 2 — Serenity
S 3 — Contentment with life
S 4 — Patience/Denial of irritability
S 5 — Denial of moral flaws
Clinical scales
1 Hs — Hypochondriasis
2 D — Depression
3 Hy — Hysteria
4 Pd — Psychopathic deviate
5 Mf — Masculinity-Femininity
6 Pa — Paranoia
7 Pt — Psychasthenia
8 Sc — Schizophrenia
9 Ma — Hypomania
Si — Social introversion
RC (Restructured clinical) Scales
RCd — dem — Demoralization
RC1 — som — Somatic complaints
RC2 — lpe — Low positive emotions
RC3 — cyn — Cynicism
RC4 — asb — Antisocial behavior
RC6 — per — Ideas of persecution
RC7 — dne — Dysfunctional negative
emotions
RC8 — abx — Aberrant experiences
RC9 — hpm — Hypomanic activation
Clinical subscales
Harris-Lingoes subscales
Dl — Subjective depression
D2 — Psychomotor retardation
D3 — Physical malfunctioning
D4 — Mental dullness
D5 — Brooding
Hyl — Denial of social anxiety
Hy2 — Need for affection
Hy3 — Lassitude-Malaise
Hy4 — Somatic complaints
Hy5 — Inhibition of aggression
Pdl — Familial discord
Pd2 — Authority problems
Pd3 — Social imperturbability
Pd4 — Social alienation
Pd5 — Self-alienation
Pal — Persecutory ideas
Pa2 — Poignancy
Pa3 — Naivete
Scl — Social alienation
Sc2 — Emotional alienation
Sc3 — Lack of ego mastery-cognitive
Sc4 — Lack of ego mastery— conative
Sc5 — Lack of ego mastery-defective
inhibition
Sc6 — Bizarre sensory experiences
Mai — Amorality
Ma2 — Psychomotor acceleration
Ma3 — Imperturbability
Ma4 — Ego inflation
Social introversion subscales
Sil — Shyness/Self-consciousness
Si2 — Social avoidance
Si3 — Alienation - self and others
Content scales
ANX — Anxiety
FRS— Fears
OBS — Obsessiveness
DEP — Depression
HEA — Health concerns
BIZ — Bizarre mentation
ANG — Anger
CYN — Cynicism
ASP — Antisocial practices
TPA— Type A
LSE — Low self-esteem
SOD — Social discomfort
FAM — Family problems
WRK— Work interference
TRT — Negative treatment indicators
continued
240 Chapter 7
Table 7.11 continued
Content component scales
Fears subscales
FRS1 — Generalized fearfulness
FRS2 — Multiple fears
Depression subscales
DEP1— Lack of drive
DEP2— Dysphoria
DEP3 — Self-depreciation
DEP4 — Suicidal ideation
Health concerns subscales
HEA1 — Gastrointestinal symptoms
HEA2 — Neurological symptoms
HEA3 — General health concerns
Bizarre mentation subscales
B1Z1 — Psychotic symptomatology
B1Z2 — Schizotypal characteristics
Anger subscales
ANGl — Explosive behavior
ANG2 — Irritability
Cynicism subscales
CYN1 — Misanthropic beliefs
CYN2 — Interpersonal suspiciousness
Antisocial practices subscales
ASP1 — Antisocial attitudes
ASP2 — Antisocial behavior
Type A subscales
TPA 1 — Impatience
TPA2 — Gompetitive drive
Low self-esteem subscales
LSE1— Self-doubt
LSE2 — Submissiveness
Social discomfort
S( )D 1 — Introversion
SOD2 — Shyness
Family problems
FAM 1 — Family discord
FAM2 — Familial alienation
Negative treatment indicators
TRT1 — Low motivation
TRT2 — Inability to disclose
Supplementary scales
Personality psychopathology five scales (PSY-5)
AGGR — Aggressiveness
PSYC — Psychoticism
DISC — Disconstraint
NEGE — Negative emotionality/Neuroticism
INTR — Introversion/Low positive emotionality
Broad perso nality cha ract eristics
A — Anxiety
R — Repression
Es — Ego strength
Do — Dominance
Re — Social responsibility
Generalized emotional distress
Mt — College maladjustment
PK — Post-Traumatic Stress Disorder-Keane
MDS — Marital distress
Behavioral dyscontrol
Ho — Hostility
O-H — Overcontrolled hostility
MAC-R — MacAndrew-revised
AAS — Addiction admission
APS — Addiction potential
Gender Role
GM — Gender role — masculine
GF — Gender role — feminine
Special Indices
Welsh Code
F-K Dissimulation Index
Percent True and Percent False
Average Profile Elevation
Megargee Offender Classification System
P-A-I-N Classification
MMPI-2. Likewise, clients who fit within the lowest educational and occupational
levels might not be appropriate candidates lor the MMPI-2 because of their under-
representation within the normative ^standardization sample (Nichols, 1992). The
MMPI-2 is available on audiocassette and computer-adapted software and in
Spanish, French, and the 1 Imong languages.
Clinical Assessment 241
Minnesota Multiphasic Personality Inventory-Adolescent
(MMPI-A)
The Minnesota Multiphasic Personality Inventory — Adolescent {MMPI-A) (Butcher et
al., 1992) is a 478-item true- false, self- report inventory designed for use with ado-
lescents ages 14-18 years to assess some of the major patterns of personality and
emotional disorders. The derived scales are very similar to the MMPI-2 scales listed
in Table 7. 1 0. Items measure 6 Validity Scales, 1 Clinical Scales, 1 5 Content Scales,
6 Supplementary Scales, and about 30 Harris-Lingoes scales. Table 7.12 provides a
sample computerized interpretive report from the Pearson software package. As with
any test, it is essential that any statements from computerized sources be validated
with other clinical information. The normative sample (n = 1 ,620) was very diverse,
although it may have oversampled a more educated population. It consisted of male
(n = 805) and female (n = 815) adolescents ages 14-18 years living in eight U.S.
states; one state's sample was from an American Indian reservation. There was also a
large adolescent clinical population (n = 703). Most of these subjects were paid to
complete the test (Butcher et al., 1992). This inventory requires a 6th-grade English
reading level.
Raw scores are converted to Uniform T percentile-comparable scores for inter-
pretation through use of convenient profile forms. Different scoring keys are used
according to gender. The MMPI-A may take up to three hours to complete and can
be scored by hand or computer. It is a Level C instrument. Sample items include
"I'm afraid to go home," "Others do not really love me," and "I feel uneasy out-
doors." Test-retest reliability results range from 0.65 to 0.84 for the Clinical scales
(Butcher et al., 1992). Strong internal consistency coefficients were reported for 4 of
the 15 basic and clinical scales (r = 0.80+); 7 of 15 were between r = 0.60 and 0.80.
Two response set indicators ( VRIN and TRIN) are validity scales that show a respon-
dent's patterns of responding in an inconsistent or contradictory manner (Butcher et
al., 1992). The MMPI-A is one of the only adolescent clinical inventories to compre-
hensively incorporate a number of validity scales to evaluate client response sets
(Archer & Krishnamurthy, 2002). Unfortunately, fewer MMPI-A items demonstrate
the same discriminative value in differentiating clients from normal and clinical sam-
ples than the adult version of the test (Archer & Handel, 2001).
Bright 1 2- and 1 3-year-olds can also be tested, as well as 1 8-year-olds who have
completed high school (Lanyon, 1995). As a Level C instrument, examiners are re-
quired to undergo training and supervision prior to administration, scoring, and in-
terpretation of this test (Butcher et al., 1992). The MMPI-A has a number of unique
features appropriate for its intended use with adolescents, yet several of the scale la-
bels seem outdated and/or offensive (i.e., Masculine-Feminine, Hypomania,
Hysteria, and Psychopathic Deviate) (Claiborn, 1995). "Clinicians should recognize
that not all adolescents have the necessary skills to complete the MMPI-A" if their
reading comprehension skills are inadequate or if their cultural background and life
experiences are out of the range of the test (Butcher et al., 1992, p. 27). (Special
learning problems and English as a second language may prohibit the prerequisite
reading comprehension, including idioms or other cultural meanings.) It may be
prudent to break the testing up into smaller sessions because some adolescents may
242 Chapter 7
Table 7.12 MMPI-A The Minnesota Report: Adolescent System Interpretive Report
by Butcher & Williams for Rachel, female, age 15, outpatient mental
health center
Validity Considerations
She had a tendency to inconsistently respond True without adequate attention to item meaning.
Although herTRIN score is not elevated enough to invalidate her MMPI-A, caution is suggested
in interpreting and using the resulting profiles (see Figure 7.3).
Symptomatic Behavior
This adolescent is immature, impulsive, and hedonistic, and she frequently rebels against
authority. She may be hostile, aggressive, and frustrated. She seems unable to learn from
punishing experiences and repeatedly gets into the same type of trouble. Many young people
with this clinical profile develop severe acting-out problems and have legal, family, or school
difficulties. This individual's nonconforming and impulsive lifestyle probably includes alcohol or
drug problems.
Many externalizing behavior problems are likely. Her friends are frequently in trouble.
They may cheat others and lie to avoid problems. They show little remorse for their misbehavior.
If their difficulties pile up, they may run away.
The highest clinical scale (see Figure 7.4) in her MMPI-A clinical profile, Pd, occurs with
very high frequency in adolescent alcohol/drug or psychiatric treatment units. Over 24% of girls
in treatment settings have this well-defined peak score (i.e., with the Pd scale at least 5 points
higher than the next scale). The Pd scale is among the least frequently occurring peak elevations
in the normative girls' sample (about 3%).
Her MMPI-A Content scales profile (see Figure 7.5) reveals important areas to consider in
her evaluation. She endorsed a number of very negative opinions about herself. She reported
feeling unattractive, lacking self-confidence, feeling useless, having little ability and several faults,
and not being able to do anything well. She may be easily dominated by others.
She reported numerous problems in school, both academic and behavioral. She has limited
expectations of success in school and is not very interested or invested in succeeding. She
reported several symptoms of anxiety, including tension, worries, and difficulties sleeping.
Symptoms of depression were reported.
Interpersonal Relations
She may appear charming and tends to make a good first impression, but she is selfish,
hedonistic, and untrustworthy in interpersonal relations. She seems interested only in her own
pleasure and is insensitive to the needs of others. She seems unable to experience guilt over
causing others trouble.
Because she is unable to form stable, warm relationships, her current relationships are likely
to be quite strained. In addition, she is likely to be openly hostile and resentful at times.
Some interpersonal issues are suggested by her MMPI-A Content scales profile. Family
problems are quite significant in this person's life. She reports numerous problems with her
parents and other family members. She describes her family in terms of discord, jealousy, fault
finding, anger, serious disagreements, lack of love and understanding, and very limited
communication. She looks forward to the day when she can leave home for good, and she does
not feel that she can count on her family in times of trouble. 1 ler parents and she often disagree
about her friends. She indicates that her parents treat her like a child and frequently punish her
without cause. 1 ler family problems probably have a negative effect on her behavior in school.
She feck uncomfortable emotional distance from others. She may believe that other people do
continued
Clinical Assessment
243
110
'
110
100
100
90
90
80
80
70
70
60
60
50
50
40
40
30
30
VRIN
TRIN
Fl
F2
F
L
K
Raw Score:
6
13
12
10
22
4
13
T Score:
58
73
79
62
70
59
53
Response %:
100
100
100
100
100
100
100
Cannot Say (Raw):
Percent True:
54
Percent False:
46
Figure 7.3 MMPI-A validity pattern
no
100
90
80
70
A..
60
50
40
30
/ \s ^\/ ^\
1 >*
/ \
\ / ^
J
V
110
100
90
80
70
60
50
40
30
Hs
D
Hy
Pd
Mf
Pa
Pt
Sc
Ma
Raw Score:
6
25
26
38
25
20
29
40
28
T Score:
44
57
55
84
59
68
59
68
64
Si MAC-RACK PRO IMM A R
27 23 1 24 27 28 13
50 58 39 67 74 64 49
Response %: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
Figure 7.4 MMPI-A Basic and Supplementary Scales profile
244 Chapter 7
no
100
90
80
70
60
50
40
30
110
100
90
80
70
60
50
40
30
ANX OBS DEP HEA ALN BIZ ANG CYN CON LSE LAS SOD FAM SCH TRT
Raw Score: 15 12 18 5 12 4 9 14 13 14 10 4 23 11 15
T-Score: 65 64 68 44 69 50 49 50 63 77 66 43 73 67 64
Response °/o: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
Figure 7.5 MMPI-A Content Scales profile
Table 7.12 continued
not like, understand, or care about her. She reports having no one, including parents or friends,
to rely on.
Behavioral Stability
The relative elevation of the highest scale (Pd) in her clinical profile shows very high profile
definition. Her peak scores are likely to remain very prominent in her profile pattern it she is
retested at a later date. Her clinical profile tends to be associated with long-standing behavior
problems.
Diagnostic Considerations
A diagnosis of one of the disruptive behavior disorders is highly likely given her elevations on I'd
and A-con.
Given her elevation on the School Problems scale, her diagnostic evaluation could include
assessment of possible academic skills deficits and behavior problems. Academic
underachievement, a general lack of interest in any school activities, and low expectations of
success are likely to play a role in her problems. Her endorsement of a significant number of
depressive symptoms should be considered when arriving at a diagnosis.
She appears to be having difficulties ih.it may involve the use of alcohol or other drugs.
Adolescents with high scores on the PRO scale are usually involved with .1 peer group that uses
alcohol or other drugs. This individual's involvement in an alcohol- or drug-using lifestyle
should be further evaluated. Her use of alcohol or other drugs may be contributing 10 problems
.it home <>r in school. I [owever, she has not acknowledged through her item responses chat she
has problems with alcohol or other dings.
Clinical Assessment 245
Treatment Considerations •
Her serious conduct disturbance should figure prominently in any treatment planning. Her
Clinical scales profile suggests that she is a poor candidate for traditional, insight-oriented
psychotherapy. A behavioral strategy is suggested. Clearly stated contingencies that are
consistently followed are important for shaping more appropriate behaviors. Punishment
techniques seem to have more limited success than positive rewards for appropriate behaviors.
Treatment in a more controlled setting may need to be considered if there is no improvement in
her behavior.
Her very high potential for developing alcohol or drug problems requires attention in
therapy if important life changes are to be made. However, her relatively low awareness of or
reluctance to acknowledge problems in this area might impede treatment efforts.
She should be evaluated for the presence of suicidal thoughts and any possible suicidal
behaviors. If she is at risk, appropriate precautions should be taken.
Her family situation, which is full of conflict, should be considered in her treatment
planning. Family therapy may be helpful if her parents or guardians are willing and able to work
on conflict resolution. However, if family therapy is not feasible, it may be profitable during the
course of her treatment to explore her considerable anger at and disappointment in her family.
Alternate sources of emotional support from adults (e.g., foster parent, teacher, other relative,
friend's parent, or neighbor) could be explored and facilitated in the absence of caring parents.
There are some symptom areas suggested by the Content scales profile that the therapist
may wish to consider in initial treatment sessions. Her endorsement of internalizing symptoms
of anxiety and depression could be explored further.
She endorsed some items that indicate possible difficulties in establishing a therapeutic
relationship. She may be reluctant to self-disclose, she may be distrustful of helping professionals
and others, and she may believe that her problems cannot be solved. She may be unwilling to
assume responsibility for behavior change or to plan for her future.
This adolescent's emotional distance and discomfort in interpersonal situations must be
considered in developing a treatment plan. She may have difficulty self-disclosing, especially in
groups. She may not appreciate receiving feedback from others about her behavior or problems.
Note: This MMPI-A interpretation can serve as a useful source of hypotheses about adolescent
clients. This report is based on objectively derived scale indexes and scale interpretations that
have been developed with diverse groups of clients from adolescent treatment settings. The
personality descriptions, inferences, and recommendations contained herein need to be verified
by other sources of clinical information because individual clients may not fully match the
prototype. The information in this report should most appropriately be used by a trained,
qualified test interpreter. The information contained in this report should be considered
confidential.
Source: MMPI-A, Minnesota Multiphasic Personality Inventory — Adolescent and The Minnesota Report
trademarks of the Regents of the University of Minnesota. Distributed exclusively by NCS Pearson, Inc.,
Minneapolis, MN. Copyright 1992 license from the Regents of the University of Minnesota. All rights
reserved. Reprinted by permission of the University of Minnesota.
be too easily distracted or unable to complete the test in one sitting (Butcher et al.,
1992). The MMPI-A is a good tool that can help to measure psychopathology in
adolescents (Archer & Krishnamurthy, 2002; Claiborn, 1995) and is very useful in
planning, directing, and evaluating treatment (Lanyon, 1995).
246 Chapter 7
Millon Clinical Multiaxial Inventory-Ill (MCMI-W)
The Millon Clinical Multiaxial Inventory — HI {MCMI-III) (Millon, Davis, &
Millon, 1997) is a 175-item, true-false, self-report inventory designed to provide di-
agnostic and treatment information to clinicians in the areas of personality disorders
and clinical syndromes. Scale items measure 1 type of Clinical Personality Pattern
(Schizoid, Avoidant, Depressive, Dependent, Histrionic, Narcissistic, Antisocial,
Aggressive [Sadistic], Compulsive, Passive-Aggressive [Negativistic], Self-Defeating);
3 Severe Personality Pathologies (Schizotypal, Borderline, Paranoid); 7 Clinical
Syndromes (Anxiety, Somatoform, Bipolar: Manic, Dysthymia, Alcohol
Dependence, Drug Dependence, Post-Traumatic Stress Disorder); 3 Severe Clinical
Syndromes (Thought Disorder, Major Depression, Delusional Disorder), and 4
Modifying Indices (Disclosure, Desirability, Debasement, Validity). These scales are
grouped to reflect distinctions between acute clinical disorders pertinent to the
DSM-IV Axis I and the enduring personality characteristics found on DSM-IV Axis
II (Millon et al., 1997). The total normative population [n = 998) consisted of male
and female volunteer adults ages 18-88 years from 26 states and Canada (develop-
ment sample n = 600 and cross-validation sample n = 398).
Except for Scale V (Validity) raw scores, raw scores are converted to Base Rate
(BR) scores for interpretation. Different BR transformation tables are used for males
and females and provide cutoff points on the continuums for the 24 clinical scales
(BR = raw score 0, BR 60 = median raw score, BR 115 = highest raw score). A BR
score of 75 or higher is an indication of psychopathology (Millon et al., 1997;
Erford, 2006). The MCMI-III usually requires about 20 to 30 minutes to complete
and can be scored by hand and interpreted in about 20 to 40 minutes. It can also be
sent to the publisher by mail, or scored by onsite computer software in about 5 min-
utes (Erford, 2006). Sample items include "I've become very anxious lately," "I often
feel tired," and "I often make people angry." Internal consistency reliabilities range
from 0.66 for the Compulsive scale to 0.90 for the Major Depression scale. Twenty
of the 24 scales have reliabilities of 0.80 or higher. Test-retest reliability results range
from 0.82 to 0.96 for a 5- to 14-day interval (Millon et al., 1997). The median sta-
bility coefficient is 0.91, which provides high stability for use of the test over short
periods. Criterion-related validity correlations are moderate in magnitude (Erford,
2006).
The MCMI-III Is designed for adults 18 years and older who are seeking, or are
in, mental health treatment. Since the MCMI-III is a Level C instrument, examin-
ers are required to have "a graduate degree in psychology or a related field, or appro-
priate licensure, a course in testing theory, coursework in personality theory, or ab-
normal psychology, and appropriate experience under supervision" (Erford, 2006, p.
41). The MCMI-IIfs theoretical conceptualization and prototypes are familiar to
many clinicians because they are often covered in graduate coursework and clinical
literature. "Because it also offers scales measuring clinical syndromes (Axis I of the
DSM-IV), the diagnostician does not have to resort to a different instrument in order
to assess those areas of functioning" (( Ihoca, 2001 , p. 766). Clinicians can also make
adjustments to the CUtofl scores that place a client along a continuum of pathology
Clinical Assessment 247
based on estimates of the prevalence rate within a particular setting or local area
(Widiger, 2001). Weaknesses of the MCMI-III include the complex hand scoring
process, overrepresentation of Whites and people who differ in levels of educational
experience, and underrepresentation of most minority groups (Erford, 2006). Use
with various cultures (e.g., Korean) must be undertaken with caution (Erford, 2006;
Gunsalus & Kelly, 2001). Table 7.13 provides a computerized interpretive report for
the protocol of a 44-year old, divorced, White female outpatient. As always, infor-
mation from a computerized report must be validated by other clinical information.
Table 7.13 MCMI-III sample computerized interpretive report of a female, age 44,
White, divorced outpatient never hospitalized (Millon)
Capsule summary
MCMI-III reports are normed on patients who were in the early phases of assessment or
psychotherapy for emotional discomfort or social difficulties. Respondents who do not fit this
normative population or who have inappropriately taken the MCMI-III for nonclinical purposes
may have distorted reports. The MCMI-III report cannot be considered definitive. It should be
evaluated in conjunction with additional clinical data. The report should be evaluated by a
mental health clinician trained in the use of psychological tests. The report should not be shown
to patients or their relatives.
Interpretive considerations
The client is a 44-year-old divorced White female. She is currently being seen as an outpatient,
and she did not identify specific problems and difficulties of an Axis I nature in the demographic
portion of this test.
This patient's response style may indicate a tendency to magnify illness, an inclination to
complain, or feelings of extreme vulnerability associated with a current episode of acute turmoil.
The patient's scale scores may be somewhat exaggerated, and the interpretations should be read
with this in mind.
Profile severity
On the basis of the test data, it may be assumed that the patient is experiencing a severe mental
disorder, further professional observation and inpatient care may be appropriate. The text of the
following interpretive report may need to be modulated upward given this probable level of
severity.
Possible diagnoses
She appears to fit the following Axis II classifications best: Negativistic (Passive-Aggressive)
Personality Disorder, and Borderline Personality Disorder, with Dependent Personality Traits,
and Depressive Personality Traits.
Axis I clinical syndromes are suggested by the client's MCMI-III profile in the areas of
Major Depression (recurrent, severe, without psychotic features), Generalized Anxiety Disorder,
and Psychoactive Substance Abuse NOS (see Figure 7.6).
Therapeutic considerations
Inconsistent and pessimistic, this patient may expect to be mishandled, if not harmed, even by
well-intentioned therapists. Sensitive to messages of disapproval and lack of interest, she may
complain excessively and be irritable and erratic in her relations with therapists. Straightforward
and consistent communication may moderate her dependent/negativistic attitude. Focused, brief
treatment approaches are likely to overcome her initial oppositional outlook.
continued
248 Chapter 7
Category
Score
Raw
BR
Profile of BR Scores
60 75
Diagnostic Scales
85
115
Modifying
Indices
163
4
28
93
20
91
Disclosure
Desirability
Debasement
Clinical
Personality
Patterns
1
2A
2B
3
4
5
6A
6B
7
8A
8B
13
20
20
22
7
12
14
14
8
24
13
64
86
87
88
16
46
66
56
16
58
71
Schizoid
Avoidant
Depressive
Dependent
Histrionic
Narcissistic
Antisocial
Sadistic
Compulsive
Negativistic
Masochistic
Severe S
Personality C
Pathology p
16
23
15
64
95
70
Schizotypal
Borderline
Paranoid
Clinical
Syndromes
17
13
11
17
8
14
18
95
76
63
76
61
82
76
Anxiety Disorder
Somatoform Disorder
Bipolar: Manic Disorder
Dysthymic Disorder
Alcohol Dependence
Drug Dependence
Post-traumatic Stress
Severe SS
Clinical CC
Syndromes PP
17
21
7
66
99
66
Thought Disorder
Major Depression
Delusional Disorder
Figure 7.6 MCMI-III profile for female, age 44
Sin i in-: < opyrighi " i 1994 I >l< ANDRIEN, [nc, All rights reserved, Reprinted In permission ol Pearson Assessments, NCS Pearson, liu
M( Ml 111 1 '-' and Milion™ art trademarks ol I >I< USTDRII N, Inc.
Clinical Assessment 249
Table 7.13 continued ■
Response tendencies
This patient's response style may indicate a broad tendency to magnify the level of experience
illness or a characterological inclination to complain or to be self-pitying. On the other hand,
the response style may convey feelings of extreme vulnerability that are associated with a current
episode of acute turmoil. Whatever the impetus for the response style, the patient's scale scores,
particularly those on Axis I, may be somewhat exaggerated, and the interpretation of this profile
should be made with this consideration in mind.
The BR scores reported for this individual have been modified to account for the high self-
revealing inclinations indicated by the high raw score on Scale X (Disclosure) and the psychic
tension and dejection indicated by the elevations on Scale A (Anxiety) and Scale D (Dysthymia).
Axis II: Personality patterns
The following paragraphs refer to those enduring and pervasive personality traits that underlie
this woman's emotional, cognitive, and interpersonal difficulties. Rather than focus on the
largely transitory symptoms that make up Axis I clinical syndromes, this section concentrates on
her more habitual and maladaptive methods of relating, behaving, thinking, and feeling.
There is reason to believe that at least a moderate level of pathology characterizes the overall
personality organization of this woman. Defective psychic structures suggest a failure to develop
adequate internal cohesion and a less than satisfactory hierarchy of coping strategies. This
woman's foundation for effective intrapsychic regulation and socially acceptable interpersonal
conduct appears deficient or incompetent. She is subjected to the flux of her own enigmatic
attitudes and contradictory behavior, and her sense of psychic coherence is often precarious. She
has probably had a checkered history of disappointments in her personal and family
relationships. Deficits in her social attainments may also be notable as well as a tendency to
precipitate self-defeating vicious circles. Earlier aspirations may have resulted in frustrating
setbacks, and efforts to achieve a consistent niche in life may have failed. Although she is usually
able to function on a satisfactory basis, she may experience periods of marked emotional,
cognitive, or behavioral dysfunction.
The MCMI-III profile of this woman suggests her marked dependency needs, deep and
variable moods, and impulsive, angry outbursts. She may anxiously seek reassurance from others
and is especially vulnerable to fear of separation from those who provide support, despite her
frequent attempts to undo their efforts to be helpful. Dependency fears may compel her to be
alternately overly compliant, profoundly gloomy, and irrationally argumentative and negativistic.
Almost seeking to court undeserved blame and criticism, she may appear to find circumstances
to anchor her feeling that she deserves to suffer.
She strives at times to be submissive and cooperative, but her behavior has become
increasingly unpredictable, irritable, and pessimistic. She often seeks to induce guilt in others for
failing her, as she sees it. Repeatedly struggling to express attitudes contrary to her feelings, she
may exhibit conflicting emotions simultaneously toward others and herself, most notable are
love, rage, and guilt. Also notable may be her confusion over her self-image, her highly variable
energy levels, easy fatigability, and her irregular sleep-wake cycle.
She is particularly sensitive to external pressure and demands, and she may vacillate
between being socially agreeable, sullen, self-pitying, irritably aggressive, and contrite. She may
make irrational and bitter complaints about the lack of care expressed by others and about being
treated unfairly. This behavior keeps others on edge, never knowing if she will react to them in a
cooperative or a sulky manner. Although she may make efforts to be obliging and submissive to
others, she has learned to anticipate disillusioning relationships, and she often creates the
continued
250 Chapter 7
Table 7.13 continued
expected disappointment by constantly questioning and doubting the genuine interest and
support shown by others. Self-destructive acts and suicidal gestures may be employed to gain
attention. These irritable testing maneuvers may exasperate and alienate those on whom she
depends. When threatened by separation and disapproval, she may express guilt, remorse, and
self-condemnation in the hope of regaining support, reassurance, and sympathy.
Axis I: Clinical syndromes
The features and dynamics of the following Axis I clinical syndromes appear worthy of
description and analysis. They may arise in response to external precipitants but are likely to
reflect and accentuate several of the more enduring and pervasive aspects of this woman's basic
personality makeup.
Testy and demanding, this woman evinces an agitated, major depression that can be noted
by her daily moodiness and vacillation. She is likely to display a rapidly shifting mix of
disparaging comments about herself, anxiously expressed suicidal thoughts, and outbursts of
bitter resentment interwoven with a demanding irritability toward others. Feeling trapped by
constraints imposed by her circumstances and upset by emotions and thoughts she can neither
understand nor control, she has turned her reservoir of anger inward, periodically voicing severe
self-recrimination and self-loathing. These signs of contrition may serve to induce guilt in
others, an effective manipulation in which she can give a measure of retribution without further
jeopardizing what she sees as her currently precarious, if not hopeless, situation.
Failing to keep deep and powerful sources of inner conflict from overwhelming her
controls, this characteristically difficult and conflicted woman may be experiencing the clinical
signs of an anxiety disorder. She is unable to rid herself of preoccupations with her tension,
fearful presentiments, recurring headaches, fatigue, and insomnia, and she is upset by their
uncharacteristic presence in her life. Feeling at the mercy of unknown and upsetting forces that
seem to well up within her, she is at a loss as to how to counteract them, but she may exploit
them to manipulate others or to complain at great length.
Abuse of either legal or street drugs or both is indicated in the MCMI-III protocol of this
woman, who is often erratic, irritable, and negativistic. Her use of drugs may be both a
statement of resentful independence from the constraints of conventional life and a means of
disjoining her conflicts and liberating her uncharitable impulses toward others. An act of
assertive defiance that has undertones of self-destruction, her drug abuse may be employed with
a careless indifference to its consequences.
Related to but beyond her characteristic level of emotional responsivity, this woman
appears to have been confronted with an event or events in which she was exposed to a severe
threat to her life, a traumatic experience that precipitated intense fear or horror on her part.
Currently, the residuals of this even resemble or symbolize an aspect of the traumatic event.
Where possible, she seeks to avoid such cues and recollections. Where they cannot be anticipated
and actively avoided, as in dreams or nightmares, she may become terrified, exhibiting a number
of symptoms of intense anxiety. Other signs of distress might include difficulty falling asleep,
outbursts of anger, panic attacks, hypervigilance, exaggerated startle response, or a subjective
sense of numbing and detachment.
This moody and conflicted woman's bodily preoccupations and concerns are likely to be
produced by both physical and psychological factors, resulting in a syndrome of features
suggestive of a somatoform disorder. Enmeshed in an erratic pattern of resentment and brittle
emotions, her anxious concerns about her somatic state aggravate her characteristic sullenness,
leading her to demand attention and special treatment. Not only does she exploit her ailments to
control the lives ol others, but sin- is also likely to complain ol her discomfort in ways chat
induce others t<> feel guilty.
Clinical Assessment 251
Possible DSM-IV multiaxial diagnoses
The following diagnostic assignments should be considered judgments of personality and clinical
prototypes that correspond conceptually to formal diagnostic categories. The diagnostic criteria
and items used in the MCMI-III differ somewhat from those in the DSM-IV, but there are
sufficient parallels in the MCMI-III items to recommend consideration of the following
assignments. It should be noted that several DSM-IV Axis I syndromes are not assessed in the
MCMI-III. Definitive diagnoses must draw on biographical, observational, and interview data in
addition to self-report inventories such as the MCMI-III.
Axis I: Clinical syndrome
The major complaints and behaviors of the patient parallel the following Axis I diagnoses, listed
in order of their clinical significance and salience.
296.33 Major Depression (recurrent, severe, without psychotic features)
300.02 Generalized Anxiety Disorder
305.90 Psychoactive Substance Abuse NOS
Axis II: Personality disorders
Deeply ingrained and pervasive patterns of maladaptive functioning underlie Axis I clinical
syndromal pictures. The following personality prototypes correspond to the most probable
DSM-IV diagnoses (Disorders, Traits, Features) that characterize this patient.
Personality configuration composed of the following:
301.84 Negativistic (Passive- Aggressive) Personality Disorder
301.83 Borderline Personality Disorder with Dependent Personality Traits and Depressive
Personality Traits
Course: The major personality features described previously reflect long-term or chronic traits
that are likely to have persisted for several years prior to the present assessment. The clinical
syndromes described previously tend to be relatively transient, waxing and waning in their
prominence and intensity depending on the presence of environmental stress.
Axis IV: Psychosocial and environmental problems
In completing the MCMI-III this individual identified the following problems that may be
complicating or exacerbating her present emotional state. They are listed in order of importance
as indicated by the client. This information should be viewed as a guide for further investigation
by the clinician.
None identified
Treatment guide
If additional clinical data are supportive of the MCMI-III's hypotheses, it is likely that this
patient's difficulties can be managed with either brief or extended therapeutic methods. The
following guide to treatment planning is oriented toward issues and techniques of a short-term
character, focusing on matters that might call for immediate attention, followed by time-limited
procedures designed to reduce the likelihood of repeated relapses.
As a first step, it would appear advisable to implement methods to ameliorate this patient's
current state of clinical anxiety, depressive hopelessness, or pathological personality functioning
by the rapid implementation of supportive psychotherapeutic measures. With appropriate
consultation, targeted psychopharmacologic medications may also be useful at this initial stage.
Worthy of note is the possibility of a troublesome alcohol and/or substance-abuse disorder.
If verified, appropriate short-term behavioral management or group therapy programs should be
rapidly implemented.
continued
252 Chapter 7
Table 7.13 continued
Once this patient's more pressing or acute difficulties are adequately stabilized, attention
should be directed toward goals that would aid in preventing a recurrence of problems, focusing
on circumscribed issues and employing delimited methods such as those discussed in the
following paragraphs.
A primary short-term goal of treatment with this patient is to aid her in reducing her
intense ambivalence and growing resentment of others. With an empathic and brief focus, it
should be possible to sustain a productive, therapeutic relationship. With a therapist who can
convey genuine caring and firmness, she may be able to overcome her tendency to employ
maneuvers to test the sincerity and motives of the therapist. Although she will be slow to reveal
her resentment because she dislikes being viewed as an angry person, it can be brought into the
open, if advisable, and dealt with in a kind and understanding way. She is not inclined to face
her ambivalence, but her mixed feelings and attitudes must be a major focus of treatment. To
prevent her from trying to terminate treatment before improvement occurs or to forestall
relapses, the therapist should employ brief and circumscribed techniques to counter the patient's
expectation that supportive figures will ultimately prove disillusioning.
Circumscribed interpersonal approaches (e.g., Benjamin, Kiesler) may be used to deal with
the seesaw struggle enacted by the patient in her relationship with her therapist. She may
alternately exhibit ingratiating submissiveness and a taunting and demanding attitude. Similarly,
she may solicit the therapist's affections, but when these are expressed, she may reject them,
voicing doubt about the genuineness of the therapist's feelings. The therapist may use cognitive
procedures to point out these contradictory attitudes. It is important to keep these
inconsistencies in focus or the patient may appreciate the therapist's perceptiveness verbally but
not alter her attitudes. Involved in an unconscious repetition-compulsion in which she recreates
disillusioning experiences that parallel those of the past, the patient must not only come to
recognize the expectations cognitively but may be taught to deal with their enactment
interpersonally.
Despite her ambivalence and pessimistic outlook, there is good reason to operate on the
premise that the patient can overcome past disappointments. To capture the love and attention
only modestly gained in childhood cannot be achieved, although habits that preclude partial
satisfaction can be altered in the here and now. Toward that end, the therapist must help her
disentangle needs that are in opposition to one another. For example, she both wants and does
not want the love of those upon whom she depends. Despite this ambivalence, she enters new
relationships, such as in therapy, as if an idyllic state could be achieved. She goes through the act
of seeking a consistent and true source of love, one that will not betray her as she believes her
parents and others did in the past. Despite this optimism, she remains unsure of the trust she
can place in others. Mindful of past betrayals and disappointments, she begins to test her new
relationships to see if they are loyal and faithful. In a parallel manner, she may attempt to irritate
and frustrate the therapist to check whether he or she will prove to be as fickle and insubstantial
as others have in the past. It is here that the therapist's warm support and firmness can play a
significant short-term role in reframing the patient's erroneous expectations and in exhibiting
consistency in relationship behavior.
Although the rooted character of these attitudes and behavior will complicate the ease with
which these therapeutic procedures will progress, short-term and circumscribed cognitive and
interpersonal therapy techniques may be quite successful. A thorough reconstruction of
personality may not be necessary to alter the patient's problematic pattern. In this regard, family
treatment methods that focus on the network of relationships that often sustain her problems
may prove to be a useful technique. Group methods may also be fruitfully employed to help the
patient acquire self-control and consistency in close relationships.
Clinical Assessment 253
It is advisable that the therapist not set goals too high because the patient may not be able
to tolerate demands or expectations well. Brief therapeutic efforts should be directed to build the
patient's trust, to focus on positive traits, and to enhance her confidence and self-esteem.
Source: MCMI-III and Millon are trademarks of DICANDR1EN, Inc. MCMI-IH interpretive report
copyright 1994 by DICANDRIEN, Inc. All rights reserved. Reprinted by permission of Pearson
Assessments, NCS Pearson, Inc.
Millon Adolescent Clinical Inventory (MACI)
The Millon Adolescent Clinical Inventory (MACI) (Millon, Millon, & Davis, 1993)
is a 160-item inventory that requires a 6th-grade reading level. The MACI is de-
signed to assess an adolescent's personality, along with self-reported concerns and
clinical syndromes using 27 content scales and 4 response bias scales: Personality
Patterns, Expressed Concerns, Clinical Syndromes, and Modifying Indices. For fur-
ther breakdown of the scales, see Table 7.14. These scales coordinate with descriptive
characteristics in recent DSM classifications (Millon et al., 1993). The test was
normed using 13- to 19-year-olds. The development sample (n = 579) was 54% male
and 46% female. The two cross-validation samples (n = 139, n = 194) were 53% and
65% male, respectfully, and 47% and 35% female, respectively (Millon et al., 1993).
Over 1,000 adolescents and their clinicians from 28 states and Canada were involved
in the development of the MACI.
The MACI usually requires about 20 to 40 minutes to complete and can be
scored by hand in about 20 minutes, sent to the publisher by mail, or scored by com-
puter onsite in about 5 minutes (Erford, 2006). Sample items include "I have an at-
tractive body," "I go on eating binges frequently," and "I enjoy fighting." Internal
consistency reliabilities for the Development Sample range from 0.73 for the Scales
D (Sexual Discomfort) and Y (Desirability) to 0.91 for Scale B (Self-Devaluation).
Except for Scale W (Reliability) scores, raw scores are converted to Base Rate Scores
(BRS) for interpretation. Different BR transformation tables are used depending on
the age and gender of the adolescent and are adjusted to a value that falls between 1
and 115 (Millon et al., 1993). Internal consistencies for the two cross-validation
samples combined ranged from 0.69 for Scale D (Sexual Discomfort) to 0.90 for
Scale B (Self-Devaluation). Internal consistency coefficients for the development
sample Personality Patterns scales ranged from 0.74 for Scale 3 (Submissive) to 0.90
for Scale 8B (Self-Demeaning). Test-retest reliability results ranged from 0.57 for
Scale E (Peer Insecurity) to 0.92 for Scale 9 (Borderline Tendency) for a 3- to 7-day
interval. The median stability coefficient is reported as 0.82 (Millon et al., 1993).
Criterion-related validity correlations are moderate in magnitude (Erford, 2006).
The MACI is designed for use with emotionally disturbed adolescents ages
13-19 years as an aid to help identify, predict, and understand some of the psycho-
logical difficulties this group experiences. Since this is a Level C instrument, exam-
iners are required to have "a graduate degree in psychology or a related field, or ap-
propriate licensure, a course in testing theory, coursework in personality theory, or
abnormal psychology, and appropriate experience under supervision" (Erford, 2006,
254 Chapter 7
Table 7.14 Response bias scales and content scales
Personality patterns
Expressed concerns
Clinical syndromes
Modifying indices
Scale 1 — Introversive
Scale 2A — Inhibited
Scale 2B— Doleful
Scale 3 — Submissive
Scale 4 — Dramatizing
Scale 5 — Egotistic
Scale 6A — Unruly
Scale 6B — Forceful
Scale 7 — Conforming
Scale 8A — Oppositional
Scale 8B — Self-Demeaning
Scale 9 — Borderline tendency
Scale A — Identity diffusion
Scale B — Self-devaluation
Scale C — Body disapproval
Scale D — Sexual discomfort
Scale E — Peer insecurity
Scale F — Social insensitivity
Scale G — Family discord
Scale H — Childhood abuse
Scale AA — Eating dysfunctions
Scale BB — Substance-abuse proneness
Scale CC — Delinquent predisposition
Scale DD — Impulsive propensity
Scale EE — Anxious feelings
Scale FF — Depressive affect
Scale GG — Suicidal tendency
Scale X — Disclosure
Scale Y — Desirability
Scale Z — Debasement
Other
Scale W — Reliability
p. 41). Strengths of the MACI include ease of scoring and interpretation, personal-
ity variables mapped to DSM personality disorders, appropriateness of concerns fre-
quently expressed by emotionally disturbed adolescents, and identification of impor-
tant clinical syndromes (Retzlaff, 1995). Clinicians using the computer interpretive
report are likely to find the response cover sheet, printout, histographic display, nar-
rative, and list of correlated Axis I and II entities useful (Stuart, 1995). Weaknesses
of the MACI include the underrepresentation of participants ages 18-19 years in the
normative samples (Stuart, 1995). The manual clearly stated that use of the MACI
for any population outside the 13—19 age designation would be inappropriate
(Millon et al., 1993). There is a lack of item and scale specificity because 160 items
attempt to score 30 scales (Retzlaff). Also, overrepresentation of Whites (78.8%)
(Stuart) and males in the normative sample may make it less appropriate for use with
some populations (Millon et al., 1993). Lastly, it may not be particularly useful as a
screening level test for the general adolescent population because the norming sam-
ple did not include adolescents not identified as patients in treatment programs
(Stuart, 1995). Overall, the best use of the MACI is for hypothesis generation and
validation, outcomes assessment, and screening for pathology, not for diagnosis.
Achenbach System of Empirically Based Assessment (ASEBA)
The Achenbach System of Empirically Based Assessment {ASEBA) (Achenbach &
Rescorla, 2001) is a series of multi-informant inventories for rating the behavior of
children ages 1 72— 5 years and another for children ages 6-18 years. Each is designed
to assess competencies, adaptive functioning, and other problems through the use of
four forms: Child Behavior Checklist (CBCL/ 1 ! /j-5) and the CBCU6-18) (i.e., a par-
ent report form), Youth Self-Report (YSRior children ages 11-18 years), and Teacher's
Report Form (TRE). Items measure six AW-oriented scales that include Affective
Problems, Anxiety Problems, Attention Deficit/Hyperactivity Problems, Conduct
Clinical Assessment
255
o
£ a
o u
ii 'So
o o
X o
S 5
,o .3
£ as
*H J
<e U <
u bo
o <
2 -c
s u
Qz
UJH8HUrt;j laOKEdJiJ
= -2
a:
'2
E
u
i
5
f
Q
3
I
3
■a
a
1
-3
^ - 1 5
5 IS < <= ^ ^
c
§ 1
J
s
•6
q
o
-d
ca
pa
J<tt!(/5intflcof-
H >
— sd
o o
1 1 /
~-
(N
rj
rs
m
(*^
^*rt\or~-coooo>ON
/
-
o
o
o
o
o
o
oooooooo
o o
1 1 /
"3
/
B —
.2 e
S w
V3 CS
O H
E
33
p
T
3*
X
3
i
o
X
J3
"3
CO
Xi
E
o
1
\
a.
0-
|
5
q
f 2 ?
1 1 \
O
a
rn
«c
ON
\
O
o
O
«
N
\
>>
\
§*
[^
E
a
B
u
>
X!
1 1 1
= 1
5*
99
u
a.
V
s
'-
a.
PI
ir,
X
Ifl
a
z
U
3
33
"3
E
a
s
a
9
H
en
-3
3
3
a
1
■*t
o6
5
c-
ON
1
N
-
o
o
o
o
o
1 1
•J i
e
<a
-C
x>
O
©
■/•
c
1
E -2
2 2
•r,
n
V
■3
3
£
£
a
J2
1
G
i
o
1 1 /
<
r
2
U]
Efl
Cfl
>
X)
-z
be
/
^o
^D
£
■c
-C
«
/
o
O
O
o
o
o
o
1 1 /
c
"o
5
I
u a»
<-«
-T
Hi
■S
c
o
a
*3
E
O
1 1 A
11
ir,
•c
c.
§
1
|
1
1 1 s
«£
p
L^
UL
z
LU
0<
c
C
/
—
<N
en
^t
«n
~
1 1 /
-
O
o
o
o
-
\A
» e -s « -5
/\
■- ^
U
b
la
<fl
.SleepsLes
.SleepsMo
.TalkSuici
O.SleepPrc
2.Underac
3.Sad
J '
g 3
< A.
»
3*
A
jjj
V
eg
1
u
£»
Tr
iS
u
aq
y
^
3
a
H
-r
1-
•*
O r~ — o O O
in
"" ,
*■*
in
m
\n
«n
r- r- o\ — — —
-
o
o
o
o
o
-
r4 o O r-4 — rj
1 1 1 1 1 1 1 1 1
Oinoi/">Oi/">Oi/"iou"i<
D
oiTicncocor^^to^iriL
n
1)
H COUOKH
O
u
en
V
1)
fa
E
C
V
1
a
u
.A
u
fa
H
.-
a.
<u <
8 1
S °8
T3
3
DO
256 Chapter 7
OJm2mO<J
X o
«5
£ 5
~ a
'5 So
<•- .. vt ^
g " S IB
© 3 o C
O J= Ml u
C/J U < >
*
^ o
c« -
8 55
Hi
5,3a
op =3,9
J. a-P
SI
U a
8 1
II
81
ai
S g
§ 2 ">
t — i — I — r
Q
u-1
o
1/1
o
in
o
in
a
in
Q
o
01
<T~
CO
CO
r»
i>
CD
CD
in
in
tlUOPIU
<
s
fi
„,
E
'•>
o
p
O
p
E
Q
<«
3
<
£
E
c
s
3
M
Q
Q
Q
5
Q
Cfl
a
<55
O
s
CO
V)
£
B
H
9
c
—
CN
<-,
r-
t^
>:
>©
r^
>c
O
Tf
»n
CO — — CN (N
oo — — oooooo
r-i — — — <n O O
CC ffi J Q.
a
g
-£
E
o
3
O
<
—
§
3
#
co
CO
CO
Cfl
CO
r^
r\
m
—
cs
a
o c o
oooco — oooooooo
UCflC^Q.O'Sb
0(rt(jQi££c«
Vdrnr^ — — oo o
oo — — •— ^lOr-or;
ig | % % ? <
^ 5 -g « i i «
£ b 5 J2 a. a. v
™ « S .H u C «
_ X — — S A ~-±
06 o so oo cS d «
s s - w
OOOOOOOOOtNOOO —
c —
°- ■= ^ -=
<
=1
a
O
o
a
■3
■a
s
3 E
□
-i
z
>£■
o
<
H
Z U
o o c c
< I Z W
c E
J2 o
to CO
TtvO^CsO^NC^OO
occ — ooooooo
5
H
>
□
2
_-:
«
■b
c
g
o
K
e
J
<
£
CO
^*
m
rs
c>
m
b
b
r*
,
x:
V]
X
|
■8
c
i
s
1
a
L^
£
i
=
z
u.
s
o
(N
r*i
y
IT
z:
fN
r i
m
m
ro
".
r'.
-T
i--,
tn
( ■
<_)
I/)
■o
c
a;
on
Q
§
oo
CO
H
.=■ o
U_ cr
Clinical Assessment 257
Problems, Oppositional Defiant Problems, and Somatic Problems (Achenbach &
Rescorla, 2001). Informants are prompted to rank items (Not True), 1 (Somewhat
or Sometimes True), or 2 (Very True or Often True) and are invited to describe sev-
eral selections in detail. Item prompts include "Physically attacks people,"
"Inattentive," and "Wets the bed." The ASEBA can be completed by hand, on com-
puter, or online via the ASEBA Web-Link (vAvw.aseba.org), which permits access to
informants in remote areas. This test takes about 1 5-20 minutes to complete and
can be scored by hand or computer.
Test-retest reliability coefficients for intervals of 8-16 days were mostly in the
0.80s and 0.90s for subscales of the CBCL/6-18 and ranged from 0.91 to 0.95 for
Total Competence, Total Adaptive Functioning, and Total Problems (Achenbach &
Rescorla, 2001). "Percentiles and normalized T scores are based on national proba-
bility samples of children who had not received mental health, substance abuse, or
special education services for major behavioral, emotional, or developmental prob-
lems in the preceding 12 months" (Achenbach & Rescorla, 2001, p. 80) (see Figures
7.7 & 7.8 for sample profile forms). The ASEBA national normative sample (n =
9,052) included children from 40 states and the District of Columbia. Clinicians
may find that routine use of the ASEBA forms for intake, screening, and evaluations
gleaned from parent, teacher, and self-reports provide a broad picture of the client
and can serve as a starting point, or springboard, for discussing pertinent issues in the
clinical interview (Achenbach & Rescorla, 2001). Watson (2006) reported that it is
a psychometrically sound instrument but has some weaknesses, especially concern-
ing the scales for younger children. In addition, the directions and manuals are im-
proved over the original versions.
The ASEBA system is one of the best behavioral assessment systems currently
available (Salvia & Ysseldyke, 2004) and can be a helpful adjunct to functional be-
havioral analysis (FBA) (Gresham, Watson, & Skinner, 2001). While the CBCL,
TRF, and YSR are the most frequently used components of ASEBA, additional com-
ponents include the Direct Observation Form (DOF); a Young Adult Self-Report
( YASR) for adults ages 1 8-30 years; a Young Adult Behavioral Checklist (parent re-
port); and a Semi-structured Clinical Interview for Children and Adolescents (SCICA)
for use with children ages 6-12 years. The ASEBA is a Level B instrument.
Personality Inventory for Children-Second Edition {PIC-2)
The PIC-2 (Lachar & Gruber, 2001) is a multidimensional clinical measure of be-
havioral, emotional, and cognitive status for children ages 3-16 years. It is a screen-
ing instrument that is usually completed by the parent. The PIC-2 has 275 items in
its standard format and contains 12 psychological scales with various subscales. The
PIC-2 also contains an abbreviated behavioral summary of 96 items. The psycholog-
ical scales include Cognitive Impairment, Impulsivity and Distractibility,
Delinquency, Family Dysfunction, Reality Distortion, Somatic Concern,
Psychological Discomfort, Social Withdrawal, Social Skills Deficits, as well as three
Response Validity scales. Parents are asked to respond to the items with True or
False answers. The standardization sample generally conformed to U.S. population
258 Chapter 7
demographics with the exception of an overrepresentation of Whites and underrep-
resentation of Hispanics. There was also an overrepresentation of biological parents
and an underrepresentation of single parents (Erford & McKechnie, 2006).
No overall composite score is derived, but there are three separate composite
scale scores: Externalization-Composite, Internalization-Composite, and Social
Adjustment Composite. Raw scores can be converted to T scores when the Student
Behavior Survey, a profile form, is completed. Test-retest reliability coefficients
ranged from r = 0.82 to 0.92 and internal consistency coefficients ranged from r =
0.81 to 0.92 for the interpreted scales. Criterion validity studies were conducted but
did not use other commonly used instruments (Erford & McKechnie, 2006).
However, because this new version of the PIC-2 is a major revision of the original,
clinicians should be cautious in making diagnostic decisions using the PIC-2 until
further research and diagnostic validity studies have been conducted. The PIC-2's
primary benefit continues to be the assessment of parental perceptions of childhood
behavioral and clinical difficulties.
Devereux Scales of Mental Disorders [DSMD)
The DSMD (Naglieri, LeBuffe, & Pfeiffer, 1 996) is used to assess behaviors related
to psychopathology. It can be administered both to individuals as well as groups of
children ages 5-18 years in about 15 minutes. There are two forms of the DSMD,
the child form and the adolescent form, and each can be rated by parents, teachers,
and other appropriate professionals. There are 110 items on this inventory, which
measures nine constructs, including Conduct, Attention-Delinquency, Anxiety,
Depression, Autism, Acute Problems, Internalizing Composite, Externalizing
Composite, and the Critical Pathology Composite. Responses are based on a 5-point
scale ranging from Never to Very Frequently. Raw scores can be converted into T
scores and percentile ranks. Standardization samples generally conformed to U.S.
population demographics for both children and adolescents (Cooper, 2001).
Alpha coefficients were reported at about r = 0.90 or higher, and test-retest re-
liability coefficients were in the 0.80s and 0.90s. Interrater reliability coefficients be-
tween parents and teachers were in the 0.40s and 0.50s. This is not surprising given
that teachers and parents observe the child's behavior in two distinct ecological con-
texts (i.e., school and home). Validity studies yielded adequate results on all levels,
with items showing a strong congruence to D5M-/Kcriteria for the specific behav-
ior disorders in question (Peterson, 2001). There is some dispute in the composition
of types of participants used in the reliability and validity study samples and as to
whether the type of subjects might have caused elevated coefficients. Even so, there
is substantial normative data lor the DSMD, and it has emerged as a good assess-
ment for certain antisocial behaviors in children and adolescents.
Children's Depression Inventory {CD I)
The CD/ (Kovacs, 1992) is a self-report inventory used to assess children's depres-
sion. Parent and teacher versions are also available. It can be administered both in-
dividually as well as to small groups ol children ages 8-17 years in about 10 to 15
Clinical Assessment 259
minutes. This assessment' contains 27 items that cover all nine symptoms for a major
depressive syndrome in children as presented in the DSM-III-R. Children's responses
are based on a 3- point scale, from to 2, with 2 being the most severe (Kavan,
1992). Limited normative data are available for the CDI because it was not nation-
ally standardized. The standardization sample was inadequately small and geograph-
ically restricted (Knoff, 1992). Scoring was simple and convenient, using the
QuickScore™ forms.
Reliability and validity data are also questionable. Although coefficient alphas
from two different samples reported in the manual were consistent at r = 0.86 and
0.87, respectively, many empirical studies yielded inconsistent results. Item-total
score coefficients ranged from r = 0.08 to 0.62. A one-month test-retest reliability
coefficient was r = 0.43, while a nine-week test-retest reliability coefficient was r =
0.84. Regarding validity, the CDI had adequate correlations with the Revised
Children's Manifest Anxiety Scale but yielded low correlations with Coopersmith Self-
Esteem Inventory (Kavan, 1992). The CDI has demonstrated good discrimination be-
tween clinical and nonclinical groups (Carey, Gresham, Ruggerio, Faulstich, &
Engart, 1987; Hodges, 1990). It is obvious that more empirical data need to be col-
lected with regard to the CDI and it should not be used as a diagnostic tool
(Craighead, Curry, & Ilardi, 1995; Fristad, Emery, & Beck, 1997; Knoff, 1992).
Admittedly, the construct of depression is more difficult to accurately assess in chil-
dren than adults because depressive symptoms are more transient in younger clients.
In spite of this, the CDI is easy to administer and score and may be helpful during
initial clinical assessment (Kavan, 1992). It is, perhaps, the most commonly used
screening tool for childhood depression (Craighead et al., 1995; Fristad et al., 1997).
Reynolds Adolescent Depression Scale-Second Edition (RADS-2)
The Reynolds Adolescent Depression Scale — Second Edition {RADS-2) (Reynolds,
2002) is a 30-item self-report inventory for adolescents ages 1 1-20 years and is de-
signed to assess symptoms associated with depression. Items measure four subscales:
Dysphoric Mood (DM, 8 items); Anhedonia/Negative Affect (AN, 7 items);
Negative Self-Evaluation (NS, 8 items); and Somatic Complaints (SC, 7 items).
Sample items include "I feel lonely," "I feel like running away," and "I feel like noth-
ing I do helps anymore." The items are scored on a 4-point Likert scale (Almost
Never, Hardly Ever, Sometimes, or Most of the Time) (Blair, 2005). The RADS-2 is
a Level B test and takes about 10 minutes to administer, score, and interpret. The
normative restandardization sample {n = 3,300) for the RADS-2 was comprised of an
equal number of adolescent males and females living in the United States and
Canada. Compared to the 2000 U.S. Census, this sample was considered ethnically
diverse and heterogeneous in socioeconomic composition (Reynolds, 2002).
Raw scores are summed to derive a Depression Total score. The Depression
Total and four subscales can be converted to a T score or percentile rank according
to gender, age group, and gender by age group norms. More than 20 years of research
supports the psychometric qualities of the RADS-2, and the new version is found to
continue the tradition of a sound instrument (Blair, 2005). Internal consistency of
the Depression Total score was r = 0.92 (Reynolds, 2002). Test-retest reliability (two
260 Chapter 7
weeks) was r = 0.86 for the Depression Total score (Reynolds, 2002). Criterion-re-
lated validity studies resulted in moderate to high correlations with other measures
of depression and indicated the RADS-2 is best used as a screening level test for de-
pression (Erford, 2006). Overall, "the RADS-2 is cost- and time-efficient, easy to use,
and a reliable and valid screening instrument for adolescents with symptoms of de-
pression" (Erford, 2006, p. 58).
The RADS-2 is one of the only depression screening tests validated for use with
adolescents (Brooks & Kutcher, 2001), and its recommended clinical cutoff of T =
61+ has been shown to identify clinically severe symptoms of depression on the
Hamilton Depression Rating Scale (HDRS) (Reynolds & Mazza, 1998). The RADS-2
is a screening test and should not be used to supplant use of a clinical interview
(Davis, 1990) and is not a substitute for an interview of suicidal ideation (Reynolds,
2002). Volpe and DuPaul (2001) also indicated the RADS-2 shows some usefulness
in monitoring the effects of treatment and as one component in a comprehensive di-
agnostic approach for depression.
Symptom Checklist-90-Revised (SCL-90-R)
The SCL-90-R (Derogatis, 1992) portrays patterns of psychological symptoms in
patients and nonpatients. The SCL-90-R can be administered to groups or indi-
viduals ages 13 years to adult in about 15 to 20 minutes. Symptoms are measured
on 12 constructs: Somatization, Obsessive-Compulsive, Interpersonal Sensitivity,
Depression, Anxiety, Hostility, Phobic Anxiety, Paranoid Ideation, Psychoticism,
Global Severity Index, Positive Symptom Distress Index, and Positive Symptom
Total. There are a total of 90 items on this inventory. Clients are asked to rate their
level of discomfort with a particular problem (Not at all) to 4 (Extremely). Norms
were constructed on several standardization samples, including psychiatric out-
patients, psychiatric inpatients, adult nonpatients, and adolescent nonpatients
(Pauker, 1985).
Pauker (1985) and Payne (1985) asserted that the original SCL-90 manual re-
ported satisfactory results for internal consistency (r = 0.77-0.90) and test-retest re-
liability coefficients (r = 0.78-0.90, one week apart). The few validity studies con-
ducted portrayed comparable levels to other self-report inventories; however, more
research is needed in this area. Other criticisms included a lack of clarity in the man-
ual and the possible limitations inherent in requiring an 8th-grade reading level
when using an inventory with adolescents ages 13 years and older. Strengths of the
SCL-90-R are the quick administration and scoring procedures as well as its straight-
forward scoring criteria.
Beck Depression Inventory-Second Edition (BDl-ll)
The Beck Depression Inventory — Second Edition {BDI-II) (Beck et al., 1996) is a 21-
item self-report inventory used to assess the severity of depression of individuals ages
13 years or older. Each item is formatted on a 4-point scale (i.e., ranging from to
3 in terms of severity) and indicates a particular depressive symptom occurring dur-
Clinical Assessment 261
ing the past two weeks. The BDI-II has gone through several revisions since its orig-
inal publication. The last major revision changed the instrument from the BDI-IA
to the BDI-II in 1996 to correspond with the criteria for depressive disorders in the
Diagnostic and Statistical Manual of Mental Disorders — Fourth Edition (DSM-IV)
(American Psychiatric Association, 1994). On revision of the BDI-II, four items (i.e.,
Weight Loss, Body Image Change, Somatic Preoccupation, Work Difficulty) were
replaced with four new items (i.e., Agitation, Worthlessness, Concentration
Difficulty, Loss of Energy). In addition, two items (i.e., Changes in Sleeping Pattern
and Changes in Appetite) were revised by creating seven optional scales representing
differences between increases and decreases of severity. Paper-and-pencil record
forms, scannable record forms, and Spanish record forms are available. Current cost
information and online order are available on the website of Harcourt Assessment,
Inc. (2004b). The BDI-II takes 5 to 10 minutes to complete. Although the BDI-II
is self-administered, a trained examiner can read the questions aloud if needed.
Administration and interpretation qualification is Level C (i.e., requires doctoral-
level training in psychology, education, counseling, or related fields, or licensure or
certification as a professional counselor or other psychological professional). Hand
scoring and computer scoring are available. Summing all the responded scales yields
a total score (maximum is 63). A total score of 14 or above indicates the possibility
of depression. Although the responses for items 2a and 2b (i.e., Changes in Sleeping
Pattern and Changes in Appetite) are not considered in calculating a total score, they
should be considered in the diagnosis of depression.
The normative sample for the BDI-II consisted of 500 outpatient clients from
four different psychiatric clinics in urban and suburban areas in the United States,
and 120 students from one college in Canada (Farmer, 2001). Scores on the BDI-II
have shown to be reliable (e.g., internal consistency, test-retest reliability) and valid
(e.g., content validity, construct validity, factorial validity) (Beck et al., 1996).
Beck Anxiety Inventory (BAI)
The Beck Anxiety Inventory (BAI) (Beck et al., 1988; Beck & Steer, 1993) is a 21-
item self-report instrument used to assess the severity of anxiety of individuals ages
17 years or older. Each item on the BAI'is formatted on a 4-point scale (i.e., ranging
from Not at All=l to Severely; "I could barely stand it") and indicates symptoms re-
lated to anxiety during the past week. Paper-and-pencil record forms, scannable
record forms, and Spanish record forms are available. Current cost information and
online ordering information are available on the website of Harcourt Assessment,
Inc. (2004b).
Like the other Beck instruments discussed, the BAI is self-administered, but a
trained examiner can administer it verbally. The BAI takes 5 to 10 minutes to com-
plete. The administration and interpretation qualifications for this instrument are
also Level C. Hand scoring and computer scoring are available. Summing all re-
sponses yields a total score with a maximum of 63.
The first normative sample for the BAI consisted of 810 outpatient clients with
affective and anxiety disorders. Subsequent studies were conducted to determine the
262 Chapter 7
reliability and validity of scores (for detailed development procedures, see Beck et aJ.,
1988). Beck et al. demonstrated high internal consistency and sufficient test-retest
reliability for scores on the BAI. The test authors also demonstrated convergent va-
lidity and discriminant validity. For example, the BAI was moderately correlated
with the Hamilton Anxiety Rating Scale — Revised (HARS-R) and the Cognition
Checklist Anxiety subscale (CCL-A). Beck et al. (1988) also demonstrated factorial
validity as the BAI consisted of the two factors: (1) somatic symptoms and (2) sub-
jective anxiety and panic symptoms. However, Osman, Barrios, Aukes, Osman, &
Markway (1993) discovered four factors of the BAI: (1) Subjective, (2)
Neurophysiological, (3) Automatic, and (4) Panic.
Overall, establishing an ability to discriminate between anxiety and depression
(i.e., discriminant validity) is one of the most critical useful aspects of the BAI (Beck
at al., 1988). Thus professional counselors may find this tool useful for clarifying the
presenting problem and formulating effective treatment plans.
Beck Scale for Suicide Ideation (BSSI)
The Beck Scale for Suicide Ideation {BSSI) (Beck, Kovacs, & Weissman, 1979) is a
21 -item self-report inventory used to assess the severity of suicide ideation of indi-
viduals ages 17 years or older. Suicide ideators are defined as "individuals who cur-
rently have plans and wishes to commit suicide but have not made any recent overt
suicide attempt" (Beck, Kovacs, & Weissman, 1979, p. 344). Beck et al. (1979) first
developed a 19-item Scale for Suicide Ideation (SSI) to assess suicide intention. An
examiner completes the SSI by asking each item in a semi-structured interview for-
mat and recording the client's responses. The SSI was revised into the BSSI in 1991
through the creation of a self-report format. Paper-and-pencil record forms,
scannable record forms, and Spanish record forms are now available. Current cost
information and online ordering information are available on the website of
Harcourt Assessment, Inc. (2004b).
The BSSI consists of the three parts: (1) Items 1 through 5 (i.e., attitudes to-
ward living and dying); (2) Items 6 through 19 (i.e., suicide ideation and anticipated
reaction of the ideation); and (3) Items 20 and 21 (i.e., the number of past suicide
attempts and the seriousness of intention in the last suicide attempt) (Stewart,
1998). Each item is formatted on a 3-point scale ranging from to 2 in terms of
severity. The BSSI takes 5 to 10 minutes to complete. The BSSI is self-administered,
but a trained examiner can read the items aloud if necessary. Administration and in-
terpretation qualifications are Level C. Hand scoring and computer scoring are avail-
able. A total score is calculated, with a maximum of 42. However, because the test's
authors do not provide a cutoff score, an examiner should cautiously analyze a total
score and client responses to each item (called "critical item analysis") to examine
suicide risk (Stewart, 1998).
The normative sample for the BSSI consisted of 178 adults (126 inpatient and
52 outpatient clients) who were receiving psychiatric services and were identified as
suicide ideators. Although scores have been reliable only lor (he first 1 l ) items, the
BSSI has high internal consistency and moderate test-retest reliability (Stewart,
Clinical Assessment 263
1998). Also, the BSSIhas good construct validity. For example, the BSSI was signif-
icantly correlated with the SSI (Stewart, 1998). Although the normative sample
lacked adolescents, Steer, Kumar, and Beck (1993) demonstrated in their study using
adolescent inpatients that the BSSI was positively correlated with a history of a past
suicide attempt, the Beck Depression Inventory (BDI) (Beck et al., 1996), the Beck
Hopelessness Scale {BHS) (Beck & Steer, 1993), and the Beck Anxiety Inventory {BAT)
(Beck, Epstein, Brown, & Steer, 1988).
Professional counselors should consider using the BSSI to assess the suicide risk
of individuals who obtain a high score on the BHS, given that hopelessness may be
a significant suicide indicator for adolescents and adults, rather than depression and
anxiety (Beck et al., 1979; Steer at al., 1993).
Substance Abuse Subtle Screening lnventory-3 {SASSI-3)
The Substance Abuse Subtle Screening Inventory — 3 (SASSI-3) (Miller & Lazowski,
1 999) is a self-report inventory used to assess the probability of substance depend-
ence (e.g., alcohol or other drugs of abuse) of individuals ages 18 years or older. An
adolescent version of the SASSI is also available. Paper-and-pencil record forms,
computer versions, audiotape versions for individuals with reading problems, and
the Spanish SASSI are available. Information on current cost and other SASSI prod-
ucts and online ordering information are available on the website of the SASSI
Institute (2004).
The SASSI-3 consists of two parts, each of which is printed on a separate side of
one test form. One part contains 67 items consisting of true-false questions regard-
ing substance dependence. The other part contains 26 items (12 for alcohol use and
14 for drug use) formatted on a Likert scale ranging from (Never) to 4
(Repeatedly). For each of the Likert items, the client is asked to respond considering
one of the following four time periods: entire life, past 6 months, 6 months before a
critical event, or 6 months after a critical event. According to Miller (1997), the au-
thor of the SASSI-3, there were three main changes from the SASSI-2 that increased
accuracy: (1) A new scale, Symptoms (SYM), was created, which provides informa-
tion regarding the client's substance use and the environmental impact of substance
use on the client; (2) two items were eliminated because of reported discomfort by
some users; and (3) the four time periods mentioned above were added to the Likert
scale format. The SASSI-3 consists of 10 subscales and takes approximately 15 min-
utes to administer (for details of subscales, see Juhnke et al., 2006; Pittenger, 2003).
The subscales include Face Valid Alcohol, Face Valid Other Drug, Symptoms,
Obvious Attributes, Subtle Attributes, Defensiveness, Supplemental Addiction
Measure, Family versus Control Subjects, Correctional, and Random Answering
Pattern. Administration and interpretation are Level B (master's level in psychology,
counseling, or related fields, with certification or professional training in psycholog-
ical assessment). An examiner scores the SASSI-3 using a scoring key and obtains a
profile by plotting a raw score for each subscale; raw scores are converted into per-
centile ranks and T scores (M = 50; SD = 10). Interpretation of the results is done
according to decision rules provided in the test manual.
264 Chapter 7
Some researchers investigated reliability and validity of SA SSI-3 scores.
Lazowski, Miller, Boye, and Miller (1998) found high test-retesr reliability, internal
consistency, and criterion-related validity. However, there are some mixed results
when using the SASSI-3 with special populations (e.g., clients who have a traumatic
brain injury) For example, Arenth, Bogner, Corrigan, and Schmidt (2001) reported
lower accuracy, sensitivity, and specificity in their study investigating the utility of
the SASSI-3 to diagnose chemical dependence for individuals with brain injury.
However, Arenth et al. concluded that the SASSI-3 was promising for individuals
with brain injury, given that substance abuse strongly affects brain injury. Finally, the
customer support from the SASSI Institute is excellent, often providing free profile
consultations using an 800 number.
Eating Disorder lnventory-3 (EDI-3)
The Eating Disorder Inventory — 3 {EDI-3) (Garner, 2004) is an effective self-report
inventory for assessing the attitudes, behaviors, and psychological traits related to
Anorexia Nervosa and Bulimia Nervosa for individuals ages 12 years or older. The
EDI-3 was revised from the original EDI published in 1984 and the EDI-2 (pub-
lished in 1991). Anorexia Nervosa contains symptoms such as refusal to maintain a
minimally normal body weight and fear of gaining weight, whereas Bulimia Nervosa
contains symptoms such as binge eating, self-induced vomiting, misuse of medica-
tions (e.g., diuretics, laxatives), and excessive exercise (APA, 2000). Paper-and-pen-
cil record forms and computer versions are available. Current cost information and
online ordering information are available on the website of Psychological Assessment
Resources, Inc. (2004b).
The EDI-3 contains 91 items, broken down into 12 scales (3 eating-disorder-
specific scales and 9 general psychological scales that are highly relevant to eating dis-
orders), each of which is formatted on a 4-point scale that helps to improve the reli-
ability of some of the scales and provides a wider range of scores. In addition, the
results yield six composite scores (Eating Disorder Risk, Ineffectiveness,
Interpersonal Problems, Affective Problems, Overcontrol, and General Psychological
Maladjustment) that are helpful when creating treatment plans, interventions, and
treatment monitoring. The EDI-3 takes approximately 20 minutes to complete.
Administration and interpretation qualification is Level A (4-year-college or univer-
sity level in psychology, counseling, or related fields with certification or professional
training in psychological assessment). Each subscale score is obtained by summing all
the scores for the subscale. Plotting each subscale score on a profile and comparing
the profile to norms yields the potential severity of an eating disorder. Norms are
available for (a) patients with Anorexia Nervosa — Restricting Type; (b) patients with
Anorexia Nervosa-Binge-Eating/Purging Type; (c) patients with Bulimia Nervosa
only; and (d) Eating Disorders Not Otherwise Specified (Psychological Assessment
Resources, Inc., 2004b).
Scores on the EDI-3 have been found to be reliable and valid. According to
the publisher (Psychological Assessment Resources, 20()4b), moderate to high com-
posite reliabilities were reported for all the subscales except one (0.80s to 0. 1 )(M
Clinical Assessment 265
and test-retest reliability coefficients in the 0.90s were reported for most of the sub-
scales. Psychological Assessment Resources, Inc., reports that a relationship exists
between the EDI-3 and a wide variety of external instruments. With this new re-
vision, a Referral Form, which is a shortened form of the entire inventory, is in-
cluded. It is especially useful when trying to identify students who may be at risk
for eating disorders.
SUMMARY/CONCLUSION
KEY TERMS
Clinical assessment and proper diagnosis of mental disorders relies heavily on the
professional counselor's knowledge of the DSM-IV-TR multiaxial diagnostic system
and implementing effective and efficient interviewing and clinical testing proce-
dures. This chapter has provided a wealth of introductory material to orient the pro-
fessional counselor to each of these essential dimensions.
Professional counselors generally make clinical decisions using either a statisti-
cal model (based predominately on test scores) or a clinical judgment model (based
predominately on counselor experience). A great deal of helpful information can be
obtained from a clinical interview. Structured interviews ask a standard set of ques-
tions and allow little variation from the standardized protocol. Such procedures often
result in similar conclusions by different counselors. Unstructured interviews have
no preset list of questions and allow maximum flexibility for counselor questioning
and follow-up. But this flexibility means that different professional counselors using
unstructured interviews frequently develop different conclusions. As a compromise,
semi-structured interviews use a standardized set of questions but allow the profes-
sional counselor flexibility to pursue important information that falls outside of the
more structured format. Specialized types of interviews discussed in the chapter in-
clude the intake interview and mental status exam.
Sources of information about a client usually stem from four sources and can
be recalled using the acronym LOST: life outcome data, observer ratings, self-re-
port ratings, and test data. The chapter also explored general procedures for devel-
opment of clinical and personality tests. Some tests are based on theories of per-
sonality or clinical pathology, while others use empirical procedures such as factor
analysis or empirical-criterion keying. This chapter has provided an overview of
numerous clinical tests to familiarize the reader with instruments commonly used
by professional counselors.
clinical assessment hypothesis confirmation bias
clinical judgment inattention
DSM-IV-TR intake interview
empirical-criterion keying life outcomes
Global Assessment of Functioning mental disorder
(GAF) multiaxial classification system
hyperactivity-impulsivity observer rating
266 Chapter 7
self-fulfilling prophecy test data
self-report ratings True Response Inconsistency (TRIN)
semi-structured interview scale
statistical decision-making model unstructured interview
statistical models Variable Response Inconsistency
structured interview (VRIN) scale
t
8
Personality Assessment
by Bradley T. Erford, Kathleen McNinch, and Carol Salisbury
This chapter addresses the basic knowledge and skills required for personality
assessment. Attention is given to trait approaches, especially the five-factor
model, and to personality instruments based on trait approaches. In addition,
an introduction to projective assessment is provided. Commonly used projective as-
sessments are discussed from a classification framework, including association, pic-
ture-story construction, verbal completion, choice arrangement, and production-
expression techniques.
WHAT IS PERSONALITY?
Some people are described as having so much personality that they "ooze" with it,
others as having "no personality at all." Still others are diagnosed with a "personality
disorder." So what is this thing that appears to be so important to people that the
services of professional counselors are sought to help assess, understand, and some-
times even restructure it? You may not find it hard to imagine that experts do not
agree on a definition of personality, what comprises it, or how best to measure it.
Some believe personality is an all-encompassing construct that accounts for all of an
individual's thoughts, feelings, and behaviors. Others view personality with a much
narrower focus. The unfortunate (or fortunate) thing about science is that in order
to study something, one needs to be able to define it. Since few agree on any one
definition, the authors have chosen one that makes sense and which can serve as a
springboard to a robust discussion on personality and its assessment.
Piedmont (1998) defined personality as an intrinsic, adaptive organizational
structure that is consistent across situations and stable over time. Note the four es-
267
268 Chapter 8
sential facets of this definition. First, personality is intrinsic, meaning located within
the individual, not imposed on the individual by the environment. Second, person-
ality is an adaptive, organized structure that allows the individual to adjust (or not ad-
just) to environmental, contextual demands. These demands are basically competing
needs and desires that may come from inside or outside of the individual. Third, per-
sonality is consistent across situations — that is, one's personal goals and world view re-
main fairly constant from one situation to the next, even though ones behaviors or
thoughts can be adapted in different ways. Finally, personality is stable over time. This
should not be understood to mean that personality does not change over time, for it
certainly does. But there is some lingering connection or thread that ties together
one's functioning during childhood, adolescence, and adulthood — consistent
themes, needs, and motivations.
Importantly, personality should not be viewed as being good or bad, because its
basic purpose is to help the individual adapt and survive in a given context.
Personality is a dynamic structure that is shaped and contoured over time to allow
the individual to adapt to environmental demands and contexts in such a way that
individual needs, desires, and motivations can be expressed. Just as in physical devel-
opment, one is born with an immature personality that grows over time and is in-
fluenced by culture and by environmental events. Personality helps one to perceive
and interpret both the internal and external world and to select goals to pursue.
Importantly, while personality does change over time, most of the change occurs
during childhood, adolescence, and young adulthood. Indeed, there is overwhelm-
ing evidence that one's personality is essentially stable by about the age of 30 years
(Piedmont, 2006), barring major transformative events (e.g., religious conversion,
significant trauma, intensive psychotherapy).
THE PURPOSE OF PERSONALITY ASSESSMENT
In general terms, the purpose of personality assessment is to help the professional
counselor and client understand the client's various attitudes, characteristics, inter-
personal needs, and intrinsic motivations in order to gain insight into current events,
activities, and conflicts and also to generalize this understanding to new situations
clients will encounter on their own, both now and in the future. In more specific
terms, personality assessment has the same purposes as most other types of assess-
ment, as discussed in Chapter 1: screening, diagnosis, placement, treatment plan-
ning, and outcomes evaluation. While diagnosis may seem out of place in the con-
text of personality as defined above, one should bear in mind the existence of
personality disorders. Personality assessment can play a crucial role in identifying in-
dividuals with some personality disorders. Professional counselors must be cognizant
of which purpose is being pursued, because of all the types of assessment instruments
available to professional counselors, structured and unstructured personality instru-
ments have the widest variability in terms of psychometric quality and usefulness;
thai is, some are extremely well developed and well studied, while others lack virtu-
ally any empirical support or rigor. As a result, experienced clinicians approach the
task of personality assessment with great seriousness and caution.
Personality Assessment 269
The two most common approaches to personality assessment are the (struc-
tured) trait approach and the (unstructured) projective approach. The discussion of
each approach and commonly used tests based on each approach make up the re-
mainder of this chapter.
TRAIT APPROACHES TO PERSONALITY ASSESSMENT
Most personality tests measure traits or states (many measure both, of course), and
it is sometimes helpful to consider traits and states as two ends of the same contin-
uum. Traits are enduring, statistically derived dimensions used to explain personal-
ity characteristics (e.g., introversion, agreeableness), while states are generally more
transient or situation-dependent facets of personal adjustment (e.g., anxiety, self-
confidence). Some measures, such as the State-Trait Anxiety Inventory for Children
(Spielberger, 1973), aim to differentiate between the presence and importance of
these two ends of the continuum. Client states are important for professional coun-
selors to understand. They are often relevant to clinical diagnosis and often serve as
the impetus for clients to actually seek counseling services. For example, many clients
endure a life of anxiety or sadness but will only seek treatment when they experience
a panic attack or major depressive episode. Acute anxious or depressive reactions are
(generally) short-lived occurrences that result from situational events and/or internal
physiology, not long-term conditions that stem from personality characteristics.
Thus states are important, but because of their unpredictability and transience, they
provide little help to clients and professional counselors who seek to understand and
predict a client's likely pattern of cognitive, affective, and behavioral functioning.
Thus most structured personality assessment deals with the identification of the
more enduring personality traits to understand and predict human behavior.
Unfortunately, social scientists who study traits disagree on a standard defini-
tion to about the same degree that they disagree on a definition of personality.
Personality traits are certainly not physical structures, although pseudoscientific ap-
proaches during the past several centuries have espoused just that. For example, phys-
iognomy is the study of personality through determining a person's physical charac-
teristics. Thus the shape of one's nose may be used to determine personality
characteristics: A pointed nose resembling a dog's snout would represent tenacity and
faithfulness, and a large, rounded nose resembling a pig's snout would represent
slovenly, piggish characteristics (Sax, 1997). Phrenology was a 19th-century system
for studying the physical characteristics of the skull (i.e., protrusions or depressions),
which were believed connected to functions within the brain. This theory espoused
that the brain center responsible for a specific ability would "grow out" (i.e., pro-
trude) when highly developed, or "sink in" (i.e., depress) when underdeveloped.
Thus phrenologists of that era were quite confident that they could identify abilities
such as concentration and secretiveness, as well as several dozen other characteristics.
Additional pseudoscientific approaches include numerology, astrology, and palm-
istry. None has received support from the scientific community.
While the study of traits has a long history of pseudoscientific attempts, it has
been studied scientifically for only a little more than half a century. In the historical
270 Chapter 8
evolution of our understanding of traits, Gordon Allport (1937) attempted to un-
derstand traits as rational dimensions that underlie the thousands of words people
use to describe each other. In one study, Allport & Odbert (1936) searched the dic-
tionary for descriptive words and identified more than 18,000 words that could de-
scribe human personality characteristics. They next whittled that list down to about
4,500 by eliminating synonyms and by retaining descriptors of stable characteristics
(remember, traits are enduring). But 4,500 is still a huge number of personality
traits. The advent of new statistical techniques (i.e., factor analysis) and high-speed
computers spurred further attempts to identify and understand the number of di-
mensions, or component traits, that underlie personality. Today, there are hundreds
of personality tests that purport to measure one or more personality traits. But until
recently, there was little agreement over the number of factors or traits that explained
human personality. For example, Cattell, Cattell, and Cattell (1993) developed the
1 6 Personality Factors inventory (16PF). Others have determined that more than 100
personality traits may exist.
However, recent well-designed research and instrumentation by Costa &
McCrae (1990, 1992) have helped to integrate much of the disparate research on
personality traits conducted over the past half century into a model with substantial
empirical support: The five-factor model (FFM). Costa & McCrae (1990, p. 23) de-
fined traits as "dimensions of individual differences in tendencies to show consistent
patterns of thoughts, feelings, and actions." There are two key parts to this defini-
tion. First, traits are dimensions, which are empirically verifiable concepts organiz-
ing human behavior along a continuum. Second, individuals differ or vary accord-
ing to how much or how little of a particular trait they may possess. It is these
differences in traits, then, that describe an individuals "personality." Costa and
McCrae identified five primary rraits along which individuals differ — not dozens or
hundreds; just five: Neuroticism, Extraversion, Openness, Agreeableness, and
Conscientiousness. For example, the trait of Extraversion involves the intensity of
interpersonal relationships. An individual can be described as introverted (i.e., shy,
aloof, withdrawn) on one end of the continuum, extraverted (i.e., sociable, outgoing,
adventurous, enthusiastic) on the other end of the continuum, or somewhere in be-
tween (i.e., ambiverted). Most importantly, the amount of the trait an individual
possesses can be measured and compared to some norm group to determine whether
the individual displays an average, significantly higher, or significantly lower amount
of the trait than other individuals with like characteristics (e.g., age, sex). The
amount of a trait a client exhibits helps professional counselors understand and pre-
dict client actions now and in the future.
Costa and McCrae and other researchers have accumulated substantial evidence
that these factors can be found on most multifaceted personality inventories available
today (see Piedmont, 2006). The FFM has emerged as a fairly comprehensive taxon-
omy, useful in classifying and understanding personality traits. The FFM traits and
facets are closely aligned with those of the Revised NEC) Personality Inventory {NEO-
I'l-R) (Costa & McCrae, l ( ) l )2), which will also be reviewed later in this chapter.
Because traits are often described as existing on a continuum (e.g. introversion-
extraversion, agreeable-disagreeable, conscientiousness-carelessness), some researchers
and Ksi developers have found it helpful 10 juxtapose these continua in order to cat-
Personality Assessment 27 1
egorize or label people according to some typology — for example, juxtaposing the
Extraversion and Neuroticism traits results in four "types" of clients. A client who is
high on both traits (i.e., high extraversion, high neuroticism) may be hot tempered,
impulsive, or easily influenced. Someone who is low on both traits (i.e., low extraver-
sion, low neuroticism) may be calm, impassive, and reliable. One who is high on ex-
traversion and low on neuroticism may be easygoing, talkative, and optimistic. One
who is low on extraversion and high on neuroticism may be pessimistic, sad, and
sober. Note the consistent use of the phrase "may be," for these characteristics are cer-
tainly not representative of all individuals of a given type under all circumstances. Still,
research (and common sense) indicates that the more of a given trait one possesses, the
more stable the categorization, and the greater the predictive validity.
While juxtaposing two or more continua can be done with virtually any set of
traits, some tests and theories are predicated on such a system. For example, the
Myers-Briggs Type Indicator — Form M {MBTI) (Myers, McCaulley, Quenk &
Hammer, 1998), a very commonly used personality inventory, was based upon the
theory of Carl Jung (1923). With the exception of the MBTI, the development and
use of tests based on typologies has been on the decline over the past several decades,
ostensibly due to increased societal sensitivity to stereotyping of people. Likewise,
numerous cautionary chimes have been sounded regarding the potential dangers of
using personality instruments with clients from culturally diverse backgrounds
(Anderson, 1995; Campos, 1989; Hinkle, 1994). In the final analysis, the focus
among structured personality assessment today is firmly on the objective measure-
ment and analysis of personality traits for their descriptive and predictive value.
Strengths and Limitations of the Trait Approach
Traits have substantial potential value when used judiciously by professional coun-
selors. Piedmont (2006) suggested that professional counselors can use traits ap-
proaches in six primary ways: (1) understanding the client; (2) making differential
diagnoses; (3) establishing empathy and rapport; (4) giving feedback and insight; (5)
anticipating the course of therapy; and (6) matching treatments to clients.
Structured trait approaches to personality assessment have several noteworthy
strengths. Trait inventories are relatively easy to administer, score, and interpret, ei-
ther by hand or by computer. Most trait inventories are also norm referenced, allow-
ing comparison of an individual's scores to a norm group. This allows examiners to
determine whether clients have an average amount of a given trait, higher than av-
erage amounts, or lower than average amounts. Remember that knowing how much
of a given trait an individual possesses is often useful in predicting client actions and
outcomes.
Perhaps the greatest strength of trait approaches to personality assessment is that
they focus on normal, healthy personality functioning, not just the clinical or patho-
logical aspects of personality. In this way, they help us to understand a client's
strengths and protective factors, rather than providing a myopic focus on a client's
weaknesses and vulnerabilities.
Because traits are empirically derived constructs, they actually do exist in nature
and can be observed and measured reliably. Traits also usually have robust predictive
272 Chapter 8
validity that can be empirically verified. In fact, research on the FFM has shown
traits can predict a significant amount of variance across a wide range of clinical out-
comes. Thus professional counselors can rely on knowledge of client traits to develop
rapport, communicate in the most effective therapeutic manner, and, in general,
structure treatment in the most efficacious manner.
Trait inventories are also amenable to computer scoring and interpretation,
which can save professional counselors time and clients money. The standardized
programming of computerized reports also tends to minimize scoring errors and ex-
aminer bias in judgment and interpretation. In addition, predictions and narrative
written into the program usually are based on empirical evidence. This is in contrast
to constructed commentary by examiners who vary substantially in experience and
expertise. On the flip side, computer programs are frequently criticized for promot-
ing a loss of individuation (i.e., every report sounds the same). Because examiners
almost never have access to the programming language, it is usually impossible to
evaluate the source and veracity of narrative statements generated by the report, or
even the standard scores derived by internal scoring and conversion programs (Note:
Fortunately, norm tables for most computerized interpretive tables are still published
in hard-copy formats so clinicians can verify score accuracy by hand if necessary).
Finally, given the boilerplate statements generated by many computerized programs,
some professional counselors may question the accuracy of interpretive statements
for the actual client being assessed. Several of the tests reviewed below and in the pre-
vious section have examples of computer-generated reports.
While very helpful, trait approaches do not escape substantial criticism. Some of
the criticism is more theoretical or philosophical, while some involves more practi-
cal aspects. In regard to the theoretical and practical issues, some question how use-
ful and helpful descriptions of personality can possibly be without some overriding
theory to hold them together and bring meaning in some holistic manner. Indeed,
little explanation or rationale has been offered as to why the traits even exist, how
they develop and become differentiated over time, or even the degree to which each
is genetically determined or environmentally influenced. On a more philosophical
level, trait approaches are sometimes criticized for being tautological (redundant) in
nature; that is, we know that outgoing, energetic, and sociable people are extraverted
because extraverted people are outgoing, energetic, and sociable (Piedmont, 2006).
Another criticism is that different models predict different numbers of primary
traits. While this may be expected on the basis of one's theoretical orientation, please
recall that there is no theoretical orientation. These models are statistically derived
subjected to empirical validation (i.e., "I exist (statistically); therefore I am"). Much
of the recent evidence supports the five-factor model. But are there more than five
factors? Costa and McCrae do not deny the possibility, and a research associate of
theirs, Ralph Piedmont (2006), has identified a sixth factor, spirituality, using the
same methodology that Costa and McCrae used to derive the original five factors. A
holistic, integrative explanation based in theory is a critical next step in making trait
approaches more explanatory (note the tautological emphasis).
There are several criticisms of trait approaches grounded more in the realm of
pragmatics, first, self-report instruments usually only measure superficial portions of
personality functioning that a client or observer ot the client could also readily iden-
Personality Assessment 273
tify through an effective interview process. In a related criticism, trait approaches often
lack the explanatory depth of projectives (psychoanalysis) and provide less insight into
the client's internal world. Relatedly, professional counselors must ensure that all per-
sonality assessment is conducted according to the highest degree of ethical practice
and guard against an invasion of privacy or inappropriate disclosure of information to
others who may misunderstand or misuse the results (e.g., discriminate against clients
with "undesirable" characteristics by limiting their opportunities).
Finally, a primary criticism continues to be that self-report trait inventories are
a relatively transparent means of obtaining information about clients. As such, trait-
based inventories are susceptible to client response sets and faking (e.g., acquies-
cence, nonacquiescence, malingering, socially desirable responses). It is inevitable
that some clients will answer in a guarded manner, while others will be too self-crit-
ical. More and more structured inventories are including validity scales to allow pro-
fessional counselors to identify clients who may be presenting with a response set
that could invalidate interpretations.
SOME COMMONLY USED STRUCTURED PERSONALITY
ASSESSMENT INVENTORIES
Revised NEO Personality Inventory [NEO-PI-R)
The Revised NEO Personality Inventory {NEO-PI-R) (Costa & McCrae, 1992) is a
240-item inventory designed to measure the five major dimensions of personality
and is best used as a basic research instrument (Botwin, 1995; Digman, 1990;
Goldberg, 1992; Piedmont, 2006). The NEO-PI-R usually requires about 25 to 35
minutes for an adult to complete, and hand scoring can be done quickly. Scale items
measure Neuroticism, Extraversion, Openness to Experience, Agreeableness, and
Conscientiousness, and each of these scales has six subscales (Botwin, 1995). Table
8.1 contains factor facets and descriptions from the NEO-PI-R (Costa & McCrae,
1992). These scales use both a self-report and an observer-rater form and can be in-
dividually or group administered. Scores are derived from a 5-point Likert scale
ranging from Strongly Agree (1) to Strongly Disagree (5), and are translated into T
scores for interpretation. Sample items include "Watching sports bores me," "I often
feel calm and relaxed," and "It is easy for me to take charge of situations."
The self-rating, stratified sample consisted of 500 men and 500 women
(screened from a larger pool of 2,273 people) and was selected demographically to
match 1995 U.S. Census projections. The attention to sample selection is an im-
provement over the NEO-PI (Botwin, 1995). Observer rating norms were obtained
from 143 ratings of 73 men and 134 ratings of 69 women from both spouses and
multiple peer ratings (Costa & McCrae, 1992; Piedmont, 2006). Internal consisten-
cies for individual facet scales ranged from r = 0.56 to r = 0.81 in self-reports and
from r = 0.60 to r = 0.90 in observer ratings (Costa & McCrae, 1992). Test-retest
reliabilities for facet scales on the original NEO ranged from r = 0.66 to r = 0.92
(McCrae & Costa, 1983). The NEO-PI-R correlated with similar scales, and con-
struct, convergent and divergent validity were found to be adequate.
274 Chapter 8
Table 8.1 NEO-PI-R descriptions of traits and facets
Domains
N: Neuroticism
E: Extraversion
O: Openness
A: Agreeableness
C: Conscientiousness
Neuroticism facets
Nl: Anxiety
N2: Angry Hostility
N3: Depression
N4: Self-Consciousness
N5: Impulsiveness
N6: Vulnerability
Extraversion facets
El: Warmth
E2: Gregariousness
E3: Assertiveness
E4: Activity
E5: Excitement seeking
E6: Positive emotions
Openness facets
Ol: Fantasy
02: Aesthetics
03: Feelings
04: Actions
05: Ideas
06: Values
Agreeableness facets
Al: Trust
A2: Straightforwardness
A3: Altruism
A4: Compliance
A5: Modesty
A6: Tender-mindedness
Conscien tio usness facets
CI: Competence
C2: Order
C3: Dutifulness
C4: Achievement striving
C5: Self-discipline
C6: Deliberation
General tendency to experience negative affects
Sociability, assertiveness, activeness, talkativeness
Active imagination, aesthetic sensitivity, attentiveness to inner feelings, preference for variety,
intellectual curiosity, independence of judgment
Interpersonal tendencies, altruism, sympathy, eagerness to help
Control of impulses, management of desires
Apprehensive, fearful, prone to worry, nervous, tense, jittery
Tendency to experience anger and related states
Tendency to experience depressive affect
Emotions of shame and embarrassment, uncomfortable around others
Inability to control cravings and urges
Vulnerability and inability to cope with stress
Issues of interpersonal intimacy
Preference for other peoples company
Tendency toward dominance, forcefulness, and social ascendancy
Tendency toward rapid tempo and vigorous movement (energy)
Tendency to crave excitement and stimulation
Tendency to experience positive emotions
Intensity of imagination and fantasy life
Appreciation for and interest in art and beauty
Openness to feelings, receptivity to one's own inner feelings, evaluation of emotion as an
important part of life
Behavioral willingness to try different activities, etc.
Intellectual curiosity, open-mindedness, willingness to consider new things, ideas
Readiness to reexamine social, political, and religious values
Tendency to trust or distrust others
Frankness, sincerity, and ingenuousness relative to others
Concern for others' welfare, generosity, consideration of others
Characteristic reactions to interpersonal conflict
Humbleness, self-efficacy
Attitudes of sympathy and concern for others
Sense that one is capable, sensible, prudent, and effective
Tidiness, level of organization
Governed by conscience
Levels of aspiration and hard work toward goals
Ability to begin tasks and carry them through to completion
Tendency to think carefully before acting
Source: Revised NEO Personality Inventory (NFO-I'I-R) andNEO Five-Factor Inventory (NFO-FFF) Professional Manualhy P. T Costa Jr. & R. R.
McCrac, (1992). Odessa, HI.: Psychological Assessment Resources.
Personality Assessment 275
Think About It 8.1 Using Table 8. 1 , describe your own personality
using the five-factor model.
16 Personality Factors (16PF) Questionnaire
The 16PF Questionnaire (Cattell et al., 1993) is a 185-item self-report inventory for
clients ages 16 years to adult and is designed to measure normal personality character-
istics, problem-solving abilities, and preferred work activities and to identify problems
in areas known to be problematic to adults. Items of the 16PF measure Anxiety,
Extraversion, Independence, Self-Control, and Tough-Mindedness (Erford, 2006)
and can be used to predict vocational interest as classified by Holland's occupational
typology (Conn & Rieke, 1994). The 16PF may prove helpful as a career counseling
tool and as a work behavior and work attitude device (Vansickle & Conn, 1996).
Administration of the 16PF requires a 5th-grade reading level and can be conducted
for individuals or groups by paper and pencil in 30 to 50 minutes, or in 25 to 35 min-
utes by computer (Russell & Karol, 1994). Scoring can be done by hand using four
scoring keys, a norm table, and an Individual Record form, or by computer through
a mail-in scoring service or the Institute for Personality and Ability Testing's (IPAT)
OnSite System software. Raw scores are converted into standardized (sten) scores that
are based on a 10-point scale (M= 5.5; SD = 2) (Russell & Karol, 1994). Sample items
include "I often like to watch team games, a) true; b) false," and "I prefer friends who
are: a) quiet; b) ?; c) lively." A portion of a sample computerized 16PF Basic
Interpretive Report from IPAT is provided in Table 8.2. Professional counselors may
also be interested in the Karson Clinical Report {KCR) and Cattell Comprehensive
Personality Interpretation (CCPI). Sample reports can be viewed at www.ipat.com.
The stratified normative sample (n = 2,500) consisted of approximately equal
numbers of males and females from every U.S. state and the District of Columbia,
closely representing the demographic variables of gender, race, age, and education in
the 1990 U.S. census. Reliability reports of scores on the 16PF are low, with only
the Social Boldness scale consistently above r = 0.80 (Erford, 2006). Clinicians
should be cautious when using this inventory for high school graduates and people
over age 65, because these were underrepresented in the normative sample
(McLellan, 1995). While the 16PF may prove helpful in developing or confirming
hypotheses about client personality characteristics, score reliability and validity are
generally inadequate for decision-making purposes, unless used in conjunction with
multiple sources of information.
One of the primary criticisms of the 16PF continues to be the identification of
too many primary factors (Chernyshenko, Stark, & Chan, 2001; Digman & Inouye,
1986), and second-order factor analytic studies indicate that about 4 to 6 factors ex-
plain the items' variance to a more substantial degree; after all, many of the 16 fac-
tors are highly intercorrelated. The addition of impression management scales are a
benefit in interpretation (Schueger, 1992).
276 Chapter 8
Table 8.2 16PF Basic Interpretive Report for a 33-year-old female.
RESPONSE STYLE INDICES
Index Raw Score
Impression Management 19 within expected range
Infrequency within expected range
Acquiescence 51 within expected range
All response style indices are within the normal range.
16PF PROFILE
Sten Factor
Left meaning
Low Average High
GLOBAL FACTORS
Right meaning
1 2 3
8 9 10
6
Warmth (A)
Reserved
—
Warm
9
7
Reasoning (B)
Emotional Stability (C)
( onrrpfp
+
A K^rrai-r
VjVJ 1 1 V_ 1 \. 1 1.
Reactive
Emotionally Stable
6
Dominance (E)
Deferential
—
Dominant
5
Liveliness (F)
Serious
Lively
6
Rule-Consciousness (G)
Expedient
Rule-Conscious
8
Social Boldness (H)
Shy
+
Socially Bold
7
Sensitivity (I)
Utilitarian
Sensitive
4
Vigilance (L)
Trusting
Vigilant
7
Abstractedness (M)
Grounded
Abstracted
4
Privateness (N)
Forthright
Private
6
Apprehension (O)
Self-Assured
-
Apprehensive
9
Openness to Change (Ql)
Traditional
+
Open to Change
4
Self-Reliance (Q2)
Group-Oriented
Self-Reliant
4
Perfectionism (Q3)
Tolerates Disorder
Perfectionistic
6
Tension (Q4)
Relaxed
Tense
Sten
Factor
Left meaning
Low Average High
Right meaning
7
5
2
7
5
Extraversion
Anxiety
Tough-Mindedness
Independence
Self-Control
Introverted
Low Anxiety
Receptive
Accommodating
Unrestrained
12 3 4 5
6 7 8 9 10
Extroverted
High Anxiety
lough-Minded
Independent
Self-Controlled
♦-
♦-
TOUGH-MINDEDNESS
Tough-Mindedness is low. Ms. Female tends to value breadth and variety of experience. Including openness to different ideas,
people, or situations. When approaching problems, she may focus on subjective or emotional considerations rather than cold,
hard facts.
■ Ms. Female <.an be sensitive to emotional and aesthetic considerations.
■ She often gets absorbed in ideas and thoughts.
■ Sin- is open to change and enjoys pursuing new ideas, opinions, and experiences.
Personality Assessment 277
EXTRAVERSION
Extraversion is high-average. Ms. Female is socially participative and probably enjoys activities involving others. Her attention is
generally directed toward other people.
■ Because this person is often socially bold, she is unlikely to feel intimidated in group settings. She may be relatively unaffected
by insults or threats.
■ When Ms. Female chooses to reveal personal matters to others, she tends to be forthright and genuine.
■ Ms. Female shows a tendency to do things and make plans with others rather than alone.
INDEPENDENCE
Independence is high-average. Generally, Ms. Female prefers to lead an independent and self-directed life. Although she can
sometimes be accommodating to others' wishes, she may often assert control or be persuasive.
■ This person is venturesome and expressive, especially in front of others. Extreme boldness sometimes can be associated with a
high desire for influence and attention.
■ Vigilance does not appear to shape her stance on influencing or persuading others. She tends to trust other people's
motivations rather than to question them.
■ She is experimenting and has an inquiring, critical mind. She tends to question traditional methods and to press for new
approaches.
ANXIETY
At the present time, Ms. Female presents herself as no more or less anxious than most people.
■ Usually, Ms. Female meets challenges with calm and inner strength.
■ She shows a tendency to be trusting and accepting of other people and their motives.
SELF-CONTROL
Self-Control is average. At times, Ms. Female may show the self-discipline and conscientiousness needed to meet her
responsibilities. At other times, she may be less restrained, following her own wishes.
■ Because this individual tends to be preoccupied with ideas, she may disregard the practical aspects of a situation.
■ This individual seems to balance casualness and a tolerance for disorder with the need for organization and structure. She may
function best in an unexacting, flexible setting rather than in a rigid system.
SELF-ESTEEM AND ADJUSTMENT
Overall, this individual tends to view herself positively, having a strong sense of self-worth and competence. She is likely to be
capable of obtaining most of her personal goals. Self- Esteem is high-average (7).
The degree of emotional stability shown by Ms. Female is typical of most adults. That is, most of the time she tends to be
calm and relaxed, but in demanding situations, she may be reactive or upset. Emotional Adjustment is average (6).
Not only is Ms. Female likely to feel quite comfortable in social gatherings, but she may initiate contact, lead conversations,
and draw attention to herself. She probably will not hesitate to express what she needs from others. Social Adjustment is high (8).
SOCIAL SKILLS
The following six scales pertain to the ways in which information is communicated in social environments. The scales are broadly
divided into two categories: nonverbal communication (Emotional Scales) and verbal communication (Social Scales). Within
each category, communication skills are discussed at three more specific levels: the ability to send information (Expressivity), to
receive and interpret messages (Sensitivity), and to control information (Control). Although a person may be more or less skilled
in certain areas, overall social competence is reflected in a general balance among the six scales below.
Ms. Female's communication is predicted to be demonstrative and forceful. That is, her emotional displays are probably
uninhibited and genuine. Her emotions are likely to be easily perceived by others, and thus are likely to influence the emotional
states of those around her. Emotional Expressivity is high (8).
continued
278 Chapter 8
Table 8.2 continued
This person may enjoy observing other people's gestures, moods, and nonverbal interactions. Thus, she may feel comfortable
interpreting people's emotional and other nonverbal messages. Emotional Sensitivity is high-average (7).
At times, Ms. Female may adapt her emotional displays to the given situation. At other times, she may be unable to suppress
a strongly felt emotion. Emotional Control is average (5).
This person is probably outgoing and articulate and would often make a good first impression. She may feel comfortable
with verbal disclosure and could probably join in most discussions with relative ease. Social Expressivity is high-average (7).
Ms. Female may not be very concerned about monitoring or interpreting others' social behavior or mannerisms. Ms.
Female's self-comfort may mean that she is not overly concerned about the appropriateness of her own actions. Social Sensitivity
is low-average (4).
This person projects a comfortable social presence. That is, she probably presents herself well in just about any type of social
situation and is likely to participate with any social group. She may consider the appropriateness of when to speak up and when
to withhold comment according to the demands of a given situation. Social Control is high (9).
This person is attentive to other people and is likely to be sensitive to their feelings. She is probably willing to consider
another person's point of view. As a consequence, others may seek her out for sympathy and support. Ms. Female should be
careful not to allow the problems of others to override her own. Empathy is high (8).
LEADERSHIP AND CREATIVITY
In a group of peers, potential for leadership is predicted to be average (6).
At the client's own level of abilities, potential for creative functioning is predicted to be high (8). She probably has the sense
of adventure, assertiveness, and orientation toward ideas that are necessary for pursuing creative interests.
Ms. Female shows characteristics somewhat similar to persons who invest a lot of time producing novel or original works.
Should this individual choose to pursue creative endeavors, her rate of output is predicted to be above average (7).
VOCATIONAL ACTIVITIES
Different occupational interests have been found to be associated with different personality qualities. The following section
compares Ms. Female's personality to these known associations. The information below indicates the degree of similarity between
Ms. Female's personality characteristics and each of the six Holland Occupational Types (Self-Directed Search; Holland, 1985).
Those occupational areas for which Ms. Female's personality profile shows the highest degree of similarity are described in greater
detail. Descriptions are based on item content of the Self-Directed Search as well as the personality predictions of the Holland
types as measured by the 16PF.
Remember that this information is intended to expand Ms. Female's range of career options rather than to narrow them. All
comparisons should be considered with respect to other relevant information about Ms. Female, particularly her interests,
abilities, and other personal resources.
123456789 10
HOLLAND THEMES
Sten
Factor
9
Artistic
7
7
Investigative
Social
6
5
Enterprising
Realistic
4
Conventional
Artisti
c = 9
Ms. Female shows personality characteristics similar to Artistic persons, who are self-expressive, typically through a particular
mode such .is art, music, design, writing, acting, composing, etc. Like Artistic persons, Ms. Female may be venturesome and open
in different views and experiences. Sometimes she may be preoccupied with thoughts and ideas, which may relate to the overall
Personality Assessment 279
creative process. She may do her best work in an unstructured, flexible environment. It may be worthwhile to explore whether
Ms. Female appreciates aesthetics and possesses artistic, design, or musical talents.
Occupational Fields: Art
Music
Design
Theater
Writing
Investigative = 7
Ms. Female shows personality characteristics similar to Investigative persons. Such persons typically have good reasoning ability
and enjoy the challenge of problem solving. They tend to have critical minds, are curious, and are open to new ideas and
solutions. Investigative persons tend to be reserved and somewhat impersonal; they may prefer working independently. They tend
to be concerned with the function and purpose of materials rather than aesthetic principles. Ms. Female may enjoy working with
ideas and theories, especially in the scientific realm. It may be worthwhile to explore whether Ms. Female enjoys doing research,
reading technical articles, or solving challenging problems.
Occupational Fields: Science
Math
Research
Medicine and Health
Computer Science
Social = 7
Ms. Female shows personality characteristics similar to Social persons, who indicate a preference for associating with other
people. Such interactions are distinguished by a nurturing, sympathetic quality. Ms. Female may find it very easy to relate to all
kinds of people. In addition to being warm and friendly, Social persons are typically receptive to different views and opinions.
They feel most comfortable in positions that allow for regular social interaction. It might be worthwhile to explore whether Ms.
Female enjoys working with others and having them seek her out for advice or comfort.
Occupational Fields: Teaching
Counseling
Psychology
Social Work
Health Services
Source: Copyright © 1994, The Institute of Personality and Ability Testing, Inc., Champaign, IL. All rights reserved. Reproduced with permission
of the Institute of Personality and Ability Testing, Inc.
Note: The original 16PF Basic Interpretive Report included graphical score displays for each interpreted factor. These graphs have been removed to
conserve space. The 16PF Basic Interpretive Report usually generates a 10-page report.
Myers-Briggs Type Indicator-Form M (MBTI)
The Myers-Briggs Type Indicator — Form M (MBTI) (Myers, McCaulley et al., 1 998) is
a 93-item self-report inventory for clients ages 14 years and older. Based on Jungian
theory, items measure four different bipolar continua: Extraversion-Introversion
(E-I), Sensing-Intuition (S-N), Thinking-Feeling (T-F), and Judging-Perceiving
(J-P). These scales result in four-letter combinations that identify and describe 16 per-
sonality types (see Table 8.3). Sample items include "Are you: easy to get to know, or
hard to get to know?" and "Can you: talk easily to almost anyone for as long as you
280 Chapter 8
Table 8.3 Examples of associated traits with MBTI typologies
Example Typology 1: Introverted-Intuition- Thinking-Judging (INTJ)
Have original minds and great drive for implementing their ideas and achieving their goals.
Quickly see patterns in external events and develop long-range explanatory perspectives. When
committed, organize a job and carry it through. Skeptical and independent, have high standards
of competence and performance - for themselves and others.
Example Typology 2: Extroverted-Sensing-Feeling-Perceiving (ESFP)
Outgoing, friendly, and accepting. Exuberant lovers of life, people, and material comforts. Enjoy
working with others to make things happen. Bring common sense and a realistic approach to
their work, and make work fun. Flexible and spontaneous, adapt readily to new people and
environments. Learn best by trying a new skill with other people.
Source: Introduction to type (6th ed.) by I. B. Myers, L. K. Kirby, & K. D. Myers, (1998), p. 13. Palo Alto,
CA: Consulting Psychologists Press.
have to, or find a lot to say only to certain people or under certain conditions?" The
MBTI requires a 7th-grade reading level and takes about 15 to 25 minutes to admin-
ister. This inventory can be hand-scored or computer-scored. Forced-choice items
produce responses that are weighted in points. The normative sample {n = 3,009) con-
sisted of U.S. adults ages 18 years and older, generally representing sex and ethnicity
consistent with the 1990 U.S. Census, although White women were overrepresented
and Black men were underrepresented (Myers, McCaulley, et al., 1998).
Split-half reliability falls above an acceptable range of 0.90 for the national sam-
ple. Test-retest reliability (4-week interval), ranged from r= 0.83 to r = 0.97, and in-
ternal consistency (coefficient alpha) for males and females ranged from r = 0.90 to
r = 0.93 (Myers, McCaulley et al., 1998). Validity of the MBTI is moderate to high
when correlated with the five-factor model as portrayed in the NEO PI-R (Erford,
2006). Construct validity was found for each of the four dichotomies (Erford, 2006;
Myers, McCaulley et al., 1998). More than 3 million people are administered the
MBTI each year (Michael, 2003). This inventory can be used to increase insight
(Fleener, 2001), to assist in career counseling in conjunction with human resource
issues (Capraro & Capraro, 2002), and to identify obstacles to career development
(Healy & Woodward, 1998). Clinicians should note that the artificial manner with
which the MBTI types people may not lead to meaningful descriptions (Vacha-
Haase & Thompson, 1999), and clients may feel restricted by reporting specific be-
haviors, attitudes, career choices, or interests (Watkins & Campbell, 2000) because
of the forced-choice test construction. While the MBTI does appear to measure at
least four important personality dimensions, the evidence does not support the es-
tablishment of 16 unique personality types (Johnson, Mauzey, Johnson, Murphy, &
Zimmerman, 2002). Finally, as with all self-report instruments, it is difficult to con-
firm the accuracy of self-perceptions constituting an MBTI client typology
(Gailbreath, Wagner, Moffett, & Hein, 1997; Gardner & Martinko, l l )%), espe-
cially when no response validity measures are provided.
Personality Assessment 281
Mi lion Index of Personality Styles Revised {MIPS Revised)
The Millon Index of Personality Styles Revised {MIPS Revised) (Millon, 2003) is a 180-
item true-false Level B self-report instrument for adults ages 1 8 years and older and is
designed to measure personality styles of normally functioning adults. Scale names
and the profile display of the original MIPS were updated to provide administrators
with a better, more intuitive approach to interpreting test results. This inventory
measures three dimensions of normal personality using 6 Motivating Style scales
(Pleasure-Enhancing, Pain-Avoiding, Actively Modifying, Passively Accommo-
dating, Self-Indulging, Other-Nurturing); 8 Thinking Style scales (Externally
Focused, Internally Focused, Realistic/Sensing, Imaginative/Intuitive, Thought-
Guided, Feeling-Guided, Conservation-Seeking, Innovation-Seeking); 10 Behaving
Style scales (Asocial/Withdrawing, Gregarious/Outgoing, Anxious/Hesitating,
Confident/Asserting, Unconventional/Dissenting, Dutiful/Conforming, Submissive/
Yielding, Dominant/Controlling, Dissatisfied/Complaining, Cooperative/Agreeing);
and 4 Validity Indices that provide information about Positive Impression, Negative
Impression, Consistency, and Clinical Index. The MIPS Revised takes about 30 min-
utes to complete using either the paper-and-pencil or computer format. An 8th-grade
reading level is required, and it is important to designate age and gender to obtain an
accurate report. The MIPS Revised can be scored by hand, computer, mail-in, or op-
tical scanning methods.
The MIPS Revised test offers separate norms for adults and college students, and
for both separate and combined genders. The adult sample consisted of 1,000 indi-
viduals (500 females, 500 males) ages 18-65 years and is stratified according to the
U.S. population by age, race or ethnicity, and education level (Millon, 2003). The
college sample consisted of 1,600 students (800 males, 800 females) selected from 14
colleges and universities to be representative of a college student population in terms
of ethnicity, age, year in school, major area of study, region of the county, and type
of institution. The MIPS Revised can be used as a screening tool in employee selec-
tion; for employee assistance programs and leadership and employee development
programs; in career planning for high school and college students; in the curriculum
for college courses in psychological testing; and in relationship, premarital, marriage,
and individual counseling.
Personality Assessment Inventory (PAI)
The Personality Assessment Inventory (PAI) (Morey, 1991) is used to assess behaviors
related to psychopathology as well as to provide information for screening, clinical
diagnosis, and treatment. It can be administered in individual or group formats to
clients ages 18 years to adult in about 40 to 50 minutes. There are 344 items on this
self-reported inventory, and responses are based on a 4-point scale (Not at All True,
Slightly True, Mainly True, and Very True). The PAI requires a 4th-grade reading
level. There are 22 nonoverlapping scales, including 4 validity scales (Inconsistency,
Infrequency, Negative Impression, Positive Impression); 1 1 clinical scales (Somatic
282 Chapter 8
Complaints, Anxiety, Anxiety-Related Disorders, Depression, Mania, Paranoia,
Schizophrenia, Borderline Features, Antisocial Features, Alcohol Problems, Drug
Problems); 5 treatment scales (Aggression, Suicidal Ideation, Stress, Nonsupport,
Treatment Rejection); and 2 interpersonal scales (Dominance, Warmth). Answers
can be scored by hand or by optical scanning, and raw scores can be converted into
T scores (Boyle, 1995).
Standardization samples conformed to U.S. population demographics with re-
spect to the test's diagnostic groups (Kavan, 1995). Reliability of scores seems ques-
tionable based on the wide range of coefficients for different variables. Internal con-
sistency coefficients for the 22 scales ranged from r = 0.45 to r = 0.90, with a median
of 0.81 (normative sample); from r = 0.22 to r = 0.89, with a median of 0.82 (col-
lege sample); and from r = 0.23 to r = 0.94, with a median of 0.86 (clinical sample).
Median alphas were consistent between various races, ages, and genders in the mid
to high 0.70s. Test-retest reliability coefficients (3- to 4-week interval) ranged from
r = 0.31 to r = 0.92, with a median of 0.82 (Boyle, 1995). Correlation studies with
the Minnesota Multiphasic Personality Inventory (MMPI) and the Marloive-Crowne
Social Desirability Scale yielded mixed validity results. Even with the disputed relia-
bility and validity information, Kavan (1995) viewed the PAIas a competitor of the
MMPI-2 that is easier to administer, score, and interpret.
California Psychological Inventory (CPI)
The California Psychological Inventory (CPI) (Gough & Bradley, 1996) is a 434-item
inventory designed to assess personality characteristics and to predict what people
will say and do in specified contexts. The CPI has numerous questions that overlap
with the original MMPI but was designed for a different population and purpose
than the MMPI (i.e., personality descriptions of a nonclinical population). Scale
items measure 20 Folk scales (Dominance, Capacity for Status, Sociability, Social
Presence, Self-Acceptance, Independence, Empathy Responsibility, Socialization,
Self-Control, Good Impression, Communality, Well-Being, Tolerance, Achievement
via Conformity, Achievement via Independence, Intellectual Efficiency,
Psychological-Mindedness, Flexibility, and Femininity-Masculinity); 3 Vector scales
(Internality-Externality, Norm-Questioning-Favoring, and Self-Realization); and 13
Special Purpose scales. These scales are for clients ages 13 years and older, are writ-
ten at a 5th-grade reading level, and take about 45 to 60 minutes to administer
(Atkinson, 2003). The CPI is self-administered and can be done using either pencil
and paper or a computer. Forms are scanned for automated data entry. Using the
scores from the three Vector scales, a cuboidal personality typology is developed,
which helps to classify individuals into four categories (Atkinson, 2003).
The normative sample (n = 6,000; 3,000 of each gender) was reported as not
being representative or random because of use of primarily high school students
(50%) and undergraduate students (16.7%), so these are probably the best popula-
tions for which to use the instrument, though the manual provides useful reference
tables for comparing students of various ages (Hattrup, 2003). The test produced in-
ternal consistency Cronbachs alpha estimates on the 20 Folk scales ranging from
Personality Assessment 283
r = 0.43 to r = 0.85, with a median of 0.76. For the three Vector scales, the internal
consistency estimates ranged from r = 0.77 to r = 0.88. Cronbach's alpha for the 13
specialty scales ranged from r = 0.45 to r = 0.88. Alpha reliabilities of the CPI scales
ranged from r = 0.62 to r = 0.84 in the total sample, with a median of 0.77. Test-
retest reliabilities were based on samples of 1 08 males and 1 29 females who were
retested after a 1-year interval, and samples of 91 females and 44 males who were
retested after 5- and 25-year intervals, respectively. For the 1-year retest, scale relia-
bilities ranged from r = 0.51 to r = 0.84, with a median of 0.68. For the 5-year and
25-year retest, reliabilities ranged from r = 0.36 to r = 0.73, and r = 0.37 to r = 0.84,
respectively. Test-retest reliability estimates among high school students were be-
tween 0.60 and 0.80 for a 1-year period. The Folk and Vector scales had moderate
to strong construct validity correlation scores (0.40 to 0.80), but the predictive
power regarding individual behavior in a given situation was weak.
Jackson Personality Inventory-Revised (IPI-R)
The Jackson Personality Inventory-Revised (JPI-R) (Jackson, 1994) is an inventory
consisting of 300 true-false statements designed to produce "a set of measures of per-
sonality reflecting a variety of interpersonal, cognitive, and value orientations"
(Jackson, 1994, p. 1). Scale items represent 15 separate personality traits: Analytical
(Complexity, Breadth of Interest, Innovation, Tolerance); Emotional (Empathy,
Anxiety, Cooperativeness); Extroverted (Sociability, Social Confidence, Energy
Level); Opportunistic (Social Astuteness, Risk Taking); and Dependable (Organiza-
tion, Traditional Values, Responsibility). This inventory is used for adolescents and
adults and takes approximately 35 to 45 minutes to administer. Raw scores range
from to 20 and are converted to a profile sheet that references gender-specific
norms using a vertical grid. Scoring can be done by hand in 3 minutes or can be
done by mail, computer, or online to produce a comprehensive client report. Sample
items include "I usually read several books at the same time," "I enjoy taking risks,"
and "I am seldom at a loss for words." The JPI-R is a Level B instrument.
Internal consistency reliability estimates for the JPI-R were obtained from four
college volunteer samples using the Cronbach alpha estimate (Jackson, 1994). In the
largest college normative sample (n = 1,107), alpha estimates ranged from r = 0.66
for the Complexity, Tolerance, and Social Astuteness scales to r = 0.87 for the
Innovation scale. In all four samples, the reliability estimates range from r = 0.62 for
Social Astuteness to r = 0.88 for Social Confidence. In two studies, median internal
consistency reliabilities (Bentler's Theta) were 0.90 and 0.93. Tables in the manual
provide validity correlations for the JPI-R with other psychological variables and
scales, including the Minnesota Multiphasic Personality Inventory (MMPI), the Survey
of Work Styles (SWS), and the Jackson Vocational Interest Survey (JVIS). Counselors
will find the manual instructions for administration and scoring easy to follow and
are cautioned that the JPI-R cannot be used to diagnose pathology (Pittenger, 1998).
The JPI-R is a helpful measure of client dispositions and can be used to help clients
develop insight and understand sources of resiliency. Table 8.4 provides a sample
computerized interpretive report for the JPI-R.
Table 8.4 Jackson Personality Inventory-Revised (JPI-R) Basic Report for Sam Sample, a 30-year-old male
Your JPI-R Scale Profile
The profile below is based on your responses to the JPI-R. For a better understanding of your scores, study the definitions and
scale descriptions and follow the profile.
Combined
Female
Male
Scale
Raw
%ile
%ile
%ile
Complexity
14
90
88
92
Breadth of Interest
19
96
96
96
Innovation
18
86
90
84
Tolerance
17
93
93
95
Empathy
11
38
24
54
Anxiety
2
2
1
4
Cooperativeness
1
4
3
4
Sociability
12
69
66
73
Social Confidence
18
86
86
86
Energy Level
18
92
96
88
Social Astuteness
12
73
76
69
Risk Taking
17
97
99
95
Organization
14
66
66
69
Traditional Values
3
4
3
4
Responsibility
14
50
38
58
Male Percent Graph
10 20 30 40 50 60 70 80 90 100
□
□
FEMALE PERCENTILE
MALE PERCENTILE
RAW SCORE Your raw score for each scale is based on your responses to the statements that make up that
scale. A high raw score indicates that you endorsed many of that scale's statements.
COMBINED PERCENTILE This score is determined by comparing your raw score for each scale with the corresponding
scores of a representative group consisting of both men and women. Your score is the
percentage of the people in the representative group who received a score equal to or less than
your score.
This score is the percentage of women in the representative group who received a raw score
equal to or less than your score. Use this score to determine how you compare to members of
the opposite sex.
This score shows how you compare to members of your own sex. Your score is the percentage
of men in the representative group who received a raw score equal to or less than yours. The
bar graph at the right of your profile is based on this score.
[Examples of Selected Scale Descriptions]
COMPLEXITY
Your percentile rank on the Complexity scale is 92, placing you in the extremely high range.
1 ligher Scorer Seeks intricate solutions to problems; is impatient with oversimplications; is interested in pur-
suing topics in depth regardless ol their difficulty; enjoys abstract thought; enjoys intricacy.
Low Scorer Prefers concrete to abstract interpretations; avoids contemplative thought; uninterested in
probing for new insight.
ANXIETY
Your percentile rank on the Anxiety scale is 4, placing you in the extremely low range.
Higher Scorer Tends to worry over inconsequential matters; more easily upset than the average person;
apprehensive about the future.
I ow Scon i Remains calm in stressful situations; takes things as they come without worrying; can relax in
difficult situations; usually composed and collected.
Your JPI-R Cluster Profile
Male percent graph
Scale Raw %ile %ile %ile 10 20 30 40 50 60 70 80 90 100
Combined
Female
Male
Raw
%ile
%ile
%ile
68
97
97
97
14
4
1
7
48
90
92
88
29
96
99
90
31
24
18
31
Analytical
Emotional
Extroverted
Opportunistic
Dependable 31 24 18 31
JPI-R Cluster Descriptions
The following cluster descriptions list the JPI-R scales that make up each cluster, as well as some of the traits found in high and
low scorers. Also listed is the range into which your cluster score falls. Use this range to determine how strongly the high and/or
low score traits apply to you. For more information on the scale scores that make up each of your cluster scores, refer back to the
profile at the beginning of this report.
ANALYTICAL
Your percentile rank on the Analytical cluster is 97, placing you in the extremely high range.
Your score on this cluster is derived from your scores on the JPI-R COMPLEXITY, BREADTH OF INTEREST,
INNOVATION, and TOLERANCE scales. If you score high on this cluster of four scales, you might be expected to consider
arguments from multiple points of view and may be inclined towards drawing distinctions among otherwise related elements of
information. On the other hand, if you score low on this cluster, you might be expected to think of things in more black-and-
white terms and to prefer straightforward, linear interpretations of events.
EMOTIONAL
Your percentile rank on the Emotional cluster is 7, placing you in the extremely low range.
This second cluster includes the JPI-R EMPATHY, ANXIETY, and COOPERATIVENESS scales. A high score on this cluster
indicates that you may express your feelings readily and that you may have difficulty hiding your emotions, especially under
stressful conditions. If your score is low, you may be relatively unaffected by emotionally arousing situations and by social
pressure.
EXTROVERTED
Your percentile rank on the Extroverted cluster is 88, placing you in the very high range.
The//Y-/?SOCIALABILITY, SOCIAL CONFIDENCE, and ENERGY LEVEL scales make up this cluster. A high score on this
cluster suggests that you are outgoing, sociable, and active. A low score indicates that you may be more introverted and less active.
OPPORTUNISTIC
Your percentile rank on the Opportunistic cluster is 90, placing you in the very high range.
Your score on this cluster is based on your scores on the JPI-R SOCIAL ASTUTENESS and RISK TAKING scales. If you scored
high on this cluster, you may be described as diplomatic, persuasive, skeptical, worldly, and charming. A low score suggests that
you may be more direct, less adventurous, and less uncritical of the self-serving intentions of others.
DEPENDABLE
Your percentile rank on the Dependable cluster is 31, placing you in the low range.
This cluster includes the JPI-R ORGANIZATION, TRADITIONAL VALUES, and RESPONSIBILITY scales. If your score on
this cluster is high, you may tend to be methodical, predictable, systematic, conservative and mature in your attitudes. Should
you score low, you may be considered to be more liberal-minded and flexible in your thinking, but less organized in your work
habits.
Source: Reproduced by permission of Sigma Assessment Systems, Inc., P.O. Box 610984, Port Huron, MI 48061-0984.
286 Chapter 8
Piers-Harris Children's Self-Concept Scale-Second Edition
(Piers-Harris-2)
The Piers-Harris Children's Self-Concept Scale, Second Edition {Piers-Harris-2) (Piers
& Herzberg, 2002) is a 60-item self-report inventory used for children ages 7-18
years who are able to read at a 2nd-grade reading level. The Piers-Harris-2 is designed
to aid in the assessment of self-concept in children and adolescents. This inventory
measures six cluster scales of Behavioral Adjustment (BEH), Intellectual and School
Status (INT), Physical Appearance and Attributes (PHY), Freedom from Anxiety
(FRE), Popularity (POP), and Happiness and Satisfaction (HAP). Sample items in-
clude "I am smart," "I feel left out of things," and "I think bad thoughts." The Piers-
Harris-2 takes about 10 to 15 minutes to complete using paper and pencil or com-
puter and is available in Spanish. The inventory requires children to circle either Yes
or No to indicate whether the statement describes the way they feel about them-
selves. Raw scores (total number of responses marked in the positive direction) can
be converted to percentiles, stanines, and T scores and are available in the form of an
overall self-concept score or as a profile of six cluster scores. Scoring can be accom-
plished by mail, fax, or computer (Piers & Herzberg, 2002).
Restandardization of the Piers-Harris-2 utilized a sample of 1 ,387 students rang-
ing from 7 to 18 years of age. These students were recruited from school districts all
across the United States closely representing the ethnic composition of the U.S. pop-
ulation according to the 2001 Bureau of the Census. Alpha coefficients for the Piers-
Harris-2 cluster scale restandardization sample ranged from r = 0.74 for the
Popularity scale to r = 0.81 for the three scales of Behavioral Adjustment, Intellectual
and School Status (INT), and Freedom from Anxiety (FRE) (Piers & Herzberg,
2002). Although test-retest reliability for the Piers-Harris-2 is not available, data for
the original 80-item Piers-Harris reported reliability of r = 0.77 (2-month interval)
and r= 0.77 (4-month interval) (Piers & Herzberg, 2002). Hattie (1992) reported
a test-retest study (4-week interval) for the Piers-Harris total score and the six clus-
ter scales using a sample of 135 Australian students in grades 10 through 12.
Reliability coefficients ranged from r = 0.65 for the Happiness and Satisfaction scale
to r = 0.88 for the Physical Appearance and Attributes scale (Piers & Herzberg,
2002). The psychometric properties of the original Piers-Harris were also reviewed
favorably (e.g., Chiu, 1988; Epstein, 1985;Jeske, 1985). The self-report feature was
also viewed as a positive (Gans, Kenny, & Ghany, 2003; Riddle & Bergin, 1997).
Professional counselors should note that the Piers-Harris-2 is not recommended for
children who are unwilling or unable to cooperate in completing the questionnaire.
It is also not recommended for children who are overtly hostile, uncooperative, un-
communicative, prone to exaggeration or other distortions, or disorganized in their
thinking. Children with poor English-language verbal ability will have difficulty
completing the scale. Spanish-speaking children should use the Spanish version of
the Piers-Harris-2.
Factor analysis of the Piers-Harris basically confirmed the original factor struc-
ture (Alexopoulos & Foudoulaki, 2002). Lower subscale reliabilities mean interpre-
tation of profile strengths and weaknesses should be undertaken with caution
(Coolcy & Ayres, 1988; Erford, 2006). The scale's question-and-response format has
Personality Assessment 287
been criticized by Strein (1995) because a Yes and No response format does not allow
a child to indicate the degree of agreement or disagreement. Marsh and Holmes
(1990) noticed many children struggling to respond accurately to questions that
were scored in the negative (e.g., "My family is disappointed in me"), thus throwing
into question the validity of some scores.
The Piers-Harris-2 is cost-effective, time-efficient, and easy to use and yields re-
liable and valid scores in the measurement of children's self-concept (Erford, 2006).
Jeske (1985, p. 1 170) indicated the original Piers-Harris "appears to be the best chil-
dren's self-concept measure currently available." This has not changed in the interim,
as verified by Kelley (2005).
Coopersmith Self-Esteem Inventories
The Coopersmith Self-Esteem Inventories (Coopersmith, 1981) are individual- or
group-administered questionnaires used to determine personal valuation of self
(Peterson, 1985). The two forms (School Form and Adult Form) were developed
based on the assumption that self-esteem is associated with effective functioning
(Sewell, 1985). The School Form is a 58-item form used with students ages 8-15
years. Built into the form is a Lie scale, which consists of eight questions that are
scored separately from the self-esteem inventory. The Lie scale is used to determine
defensiveness in the client's responses (Coopersmith, 1989). There is also a School
Short Form that consists of 25 questions, on which the Adult Form is based. The
Adult Form is used for clients over 1 5 years of age. The standardization sample in-
formation is not adequate, but several researchers have collected supplemental sam-
ples since the original inventory was standardized (Coopersmith, 1989). The relia-
bility information indicates internal consistency coefficients ranged from 0.87 to
0.92 for 4th- through 8th-graders for the total score (Sewell, 1985). Validity was re-
ported as being sufficient, but conclusive evidence was not presented, and very little
reliability or validity information is presented for the Adult Form. The Adult norm
sample was composed of 226 college students from northern California, and the re-
liability scores ranged from 0.78 to 0.85, but no further information was provided
(Coopersmith, 1989). While internal consistency estimates appear to indicate the
two forms may have some value as screening-level tests, the difficulty in defining and
measuring the concept of self-esteem remains problematic. For example, according
to the manual, there are no clearly defined criteria for determining low, medium, or
high levels of self-esteem, although higher scores are indicative of higher self-esteem.
The manual has a section for building self-esteem in students and provides some sug-
gestions and techniques. Researchers are divided about whether to recommend the
use of the inventory, but it is one of the most widely used measures of its kind
(Peterson, 1985; Sewell, 1985).
Tennessee Self-Concept Scale-Second Edition (TSCS-2)
The Tennessee Self-Concept Scale — Second Edition (TSCS-2) (Fitts & Warren, 1996)
is one of the most commonly used self-report measures of self-concept and can be
used for children and adults. The test was standardized on 3,000 subjects, ages 7-90
288 Chapter 8
Table 8.5 Scales on the Tennessee Self-Concept Scale-Second Edition
Self-concept scores Supplementary scores
Physical Identity
Moral Satisfaction
Personal Behavior
J 31 " 11 / Validity scores
Social
Academic/Work T
Inconsistent
Summary scores Responding
Self-criticism
Total self-concept Faking good
Conflict Response distribution
years, and can be administered to individuals or groups in about 10 to 20 minutes.
The Adult Form is designed for clients ages 13 years or older and has 82 items. The
Child Form is designed for students ages 7-14 years and has 76 items. A Short Form
consisting of the first 20 items of either form can be used as well. Items comprise 15
subscales and a total Self-Concept score (see Table 8.5). The items are rated on a
5-point Likert scale ranging from Always False to Always True. The TSCS-2 can be
hand-scored in approximately 10 minutes, or computer-scored (Western
Psychological Services, 2003c). Reliability is adequate, with lower internal consisten-
cies on subscales than Total Self-Concept, ranging from r = 0.73 to r = 0.93. Test-
retest reliability scores ranged from r = 0.47 to r = 0.83 (Brown, 1998). Fitts and
Warren (1996) reported acceptable levels of score validity for the TSCS-2.
Think About It 8.2 Using the self-concept scales from Table 8.5, discuss
with an acquaintance his or her levels of self-concept in each category.
Notice whether there is consistency among the categories. What causes these
consistencies or inconsistencies?
PROJECTIVE APPROACHES TO ASSESSMENT
In contrast to structured assessments of personality, which limit possible client re-
sponses, projective assessments present clients with unstructured, ambiguous stim-
uli and allow a virtually unlimited range of potential responses. Personality assess-
ment using projective techniques is based on the projective hypothesis, the
assumption that essential information about a client's personality characteristics,
needs, conflicts, and motivations will be transferred onto ambiguous stimuli.
Projective techniques are disguised and vague by design and provide clients only
minimal instructions in order to reduce external structure and force clients to im-
pose structure according to internal (intrapersonal) characteristics.
Personality Assessment 289
Projective personality assessment is based on the psychoanalytic notion of the
unconscious, that portion of one's personality that is beyond awareness and control.
According to Freud (1961, 1923, 1924), valuable understanding of one's true nature
is obtained from the dark recesses of one's unconscious emotional and thought
processes, not what is present or spoken from one's conscious mind. Freud also be-
lieved in the prominence of drive and instinct, which lead one to gratify needs while
reducing tension over unfulfilled needs. Freud's concept of psychic determinism — that
every action undertaken is done so for a reason or particular purpose — is also a key
to understanding personality. Altogether, then, Freud's psychoanalytic theory pro-
poses that when a client is presented with ambiguous stimuli and asked to respond
to the stimuli in some way (and there is not necessarily a right or wrong way to re-
spond), the client cannot help but exhibit actions and responses driven by uncon-
scious processes that reveal internal emotional or thought processes, representing
needs and desires requiring expression and gratification. Therefore, the key is to de-
velop techniques that will help clinicians gain access to a client's unconscious, allow-
ing inferences to be made about the client's personality and personal adjustment.
Such techniques are called projective techniques.
If a professional counselor places a client in an unstructured, ambiguous cir-
cumstance, the client will attempt to bring order and meaning to chaos. And how
the client brings structure to the disorder yields valuable insights into the client's
unconscious processes and serves as an indirect glimpse into the client's inner
world. There are many projective techniques available for use by professional coun-
selors, depending upon education, licensure, and professional training and expe-
rience. These techniques vary in degree of standardization, with some having rather
specific directions for administration and scoring. Often the interpretation of these
techniques is less standardized, leading to subjective judgments based upon the
professional counselor's theoretical orientation and clinical experience. Projective
techniques are classified according to the nature of the ambiguous task and how
clients are required to respond. The following five types of projective techniques
represent a comprehensive categorization: (1) association techniques; (2) picture-
story construction techniques; (3) verbal completion techniques; (4) choice
arrangement techniques; and (5) production-expression techniques.
The Rorschach Inkblot Testis an example of an association technique and is quite
possibly the best-known projective test ever developed. The Rorschach is reviewed in
greater detail below, but Figure 8. 1 presents a sample inkblot of the type included on
the Rorschach. Proponents of association techniques propose that such procedures
reveal details of the unconscious realm, similar to the way x-rays reveal the inner
realm of the body. Clients project their inner organization onto the inkblot, and ex-
aminers interpret these attempts to organize the vague stimuli. A second example of
an association task is word association. For this task, examiners present a list of neu-
tral (e.g., wood, spoon) or emotionally laden (e.g., father, sex) words one at a time,
and the client responds with the first idea, image, or word that comes to mind.
Examiners generally record the response; the amount of time required to respond
(i.e., latency effects, with lengthier time periods supposedly revealing the degree of
inner conflict/turmoil); and expressions of emotion while responding (e.g., anger,
embarrassment). Responses to association technique stimuli are usually compared
290 Chapter 8
Figure 8.1 Sample inkblot
with responses of nonclinical individuals to determine whether responses are "nor-
mal" or pathological. Interpretation of themes and content categorizations is then
conducted to reveal insights into personality functioning, inner needs, and conflicts.
Picture-story construction techniques usually involve showing a client a pic-
ture or other visual stimulus and requiring the client to construct a story about the
picture. The stimulus pictures vary in terms of scenery, people, and social situations.
The most commonly used construction technique is the Thematic Apperception Test
(TAT). A sample picture stimulus similar to a TAT card is presented in Figure 8.2.
The Children's Apperception Test (CAT) and Robert's Apperception Test for Children
(RATC) are examples of picture-story construction techniques commonly used with
children and adolescents. For Hispanic clients, another example would be the Tell
Me a Story (TEMAS). Each of these tests is reviewed in greater detail below. The
common strand through picture-story construction techniques is that the client is
shown a stimulus picture and then asked to tell a good story about the picture. The
story should describe what led up to the depicted scene, what is currently happen-
ing, and what the likely outcome of the story will be. While some of the pictures
may "pull" for different content and emotion, most are neutral and simply reflect
the unconscious process of the client. In other words, the client is given no reason to
tell a particular story about a given card in a particular manner. The assumption is
that the story the client tells, and the manner in which the client tells it, reflect some
inner need that surfaces in response to that given stimulus picture. In this way,
Personality Assessment 291
D-v
Figure 8.2 Sample picture-story card
clients convey inner thoughts and emotions and provide the content for clinicians to
interpret and contextualize.
Verbal completion techniques consist of verbal content presented in an incom-
plete format, requiring the client to complete the stimulus. Sentence and story com-
pletion tasks are among the more commonly used completion techniques. For ex-
ample, a client may be presented with a sentence stem (e.g., "I think . . ." or "Other
people treat me like . . .") and be asked to complete the stem. As with any projective
technique, the client is given no reason to provide any specific response. The as-
sumption is that some internal need, emotion, or thought is being expressed in the
face of a vague, ambiguous stimulus (e.g., "I think dogs are cute" versus "I think men
are horrid creatures," or "Other people treat me like a princess" versus "Other peo-
ple treat me like I am invisible"). The Forer Structured Sentence Completion Test is a
good example of this type of projective assessment and is reviewed in greater detail
below. A story completion test presents the client with the start of a story and requires
the client to finish the story. For example, the professional counselor may begin by
saying, "A woman leans over to kiss a man on the cheek. The man suddenly pulls
away and looks angry. Why?" The content of client responses is recorded verbatim
and thematically analyzed. An example of a story completion task that also uses pic-
tures is the Rosenzweig's Picture-Frustration Study (Rosenzweig, 1949), in which 24
cartoons depicting a potentially frustrating situation are presented to a client. Each
cartoon has a situation written in one of the "thought bubbles," and the other bub-
ble is blank. The client indicates a verbal reaction (orally or in writing) to each stim-
ulus. Responses are scored in one of three ways: (1) evasion of frustration; (2) frus-
tration directed at other people or objects; or (3) frustration directed at self.
292 Chapter 8
Think About It 8.3 Construct 5 to 10 incomplete sentences and "ad-
minister" them to several associates. What theme or patterns emerged? Do
statements phrased in certain ways lead to certain more predictable results?
How could you use projective techniques in your practice as a professional
counselor.
Choice arrangement techniques make up a diverse category, the commonality
being that clients are given several to numerous options to rank-order or select from.
Young children are often given the choice of which toys or dolls to play with in ther-
apy Again, the child is given no reason to choose any given puppet, doll, or other toy,
or to play with or tell stories using it in the particular manner he or she does. It is as-
sumed that the child's selection and ensuing actions and verbalizations are the expres-
sion of some inner motivation. Alternative choice arrangement projective techniques
include arranging pictures or words along a like-dislike continuum or a multiple-
choice response format designed for a Rorschach-Yike inkblot test. Of course, when an
examiner uses a choice arrangement format, the examinee's potential range of choices
becomes restricted. In some ways, this defeats the purpose of a projective technique,
which is to allow clients maximum leeway to respond from the unconscious.
Importantly, research supporting the use of choice techniques for assessment is very
sparse when compared with that available for other types of projective assessment.
Production-expression techniques require clients to actively participate in the
assessment by creating some product that can be analyzed and interpreted to reveal
facets of the client's personality. Commonly used techniques include drawings (e.g.,
House-Tree-Person, Human Figure Drawing, Kinetic Family Drawing, Kinetic School
Drawing), painting or coloring, or a dramatic performance (e.g., psychodrama).
Drawing techniques are by far the most commonly used assessment devices from
this category. Importantly, how clients act and respond to verbal queries while engag-
ing in this task is just as important as any characteristics of the final product, and
professional counselors using these techniques are strongly encouraged to observe,
and ask follow-up questions of, clients creating expression products. When using a
drawing technique, such as the Human Figure Drawing, clients are usually given a
blank sheet of paper and pencil (or pens, colored pencils, crayons, etc.) and asked to
draw a picture of a person. Interpretation of these drawings varies widely, depending
on the professional counselor's theoretical orientation, training, and focus.
Some test manuals and textbooks offer specific guidance for interpreting draw-
ing characteristics, or even specific objects within a drawing. For example, aggres-
sion may be indicated by heavy, dark lines; low self-esteem may be indicated by a
small drawing. Handler (1996) suggested that particular attention be paid to era-
sures, placement of the figure on the paper, too much or too little detail, shading and
heavy or pressured lines, among other things. Of critical importance is that examin-
ers not give too much emphasis to any one sign. Also, the professional counselor
should never rely solely on the drawn product for interpretive insights. It is excellent
practice to query the clieni about a drawing in order to understand what the draw-
Personality Assessment 293
ing might represent to the client. The best use of drawing characteristics and behav-
iors is for generating hypotheses to be tested out using more structured and system-
atic methods. Figures 8.3 through 8.5 display examples of various projective draw-
ing techniques.
/'
>,
)
>
/
■r
s
J
<'"I\\J*
Figure 8.3 House-Tree-Person drawings by a selfconscious, perfectionistic
teenage girl
294 Chapter 8
Figure 8.4 Kinetic Family Drawing by a 12-year-old boy with a fine-motor
Coordination Disorder and AD/HD— Predominantly Inattentive Type
Figure 8.5 Kinetic School Drawing by a 12-year-old boy with a fine-motor
Coordination Disorder and AD/HD-Predominantly Inattentive Type
Personality Assessment 295
Strengths and Weaknesses of Projective Techniques
Projective techniques have a number of noteworthy positive points and have
remained popular over the past half century (Bellak, 1992; Piotrowski & Zalewski,
1993; Watkins, 1991). Some clinicians believe that projective techniques are great
icebreakers and rapport builders when beginning an evaluation or counseling
relationship with children or adolescents, because these techniques are generally
perceived as nonthreatening, and clients need not worry about whether a particular
answer was right or wrong. Clients generally are not limited in the number or type
of responses they can make. This allows the unconscious processes maximum leeway
in projecting inner needs and motivations onto the stimulus. Also, because clients
are not generally familiar with the scoring and interpretive strategies of projective
techniques, many clinicians believe responses to projective tests are more difficult to
fake than for structured tests, although this is not necessarily the case (Masling,
1960).
Projective techniques may have valuable cross-cultural applications, especially
when the stimulus involves inkblots, drawings, or brief verbal stems. Most projec-
tives require no or very little reading ability, so they may be helpful in the assessment
of young clients and clients with poor literacy skills. Likewise, because some projec-
tive techniques require a minimum of verbal input and output, they may be helpful
techniques for use with young clients, clients from diverse cultures, or clients with
speech and language disorders. Finally, because projective techniques are based on
psychoanalytic theory, complex, multidimensional themes may emerge and provide
valuable insights into the client's personality.
Projective assessment techniques also have numerous limitations. Projective
techniques must be administered individually by highly educated and trained indi-
viduals and therefore are expensive to administer, score, and interpret. Subjective
scoring and interpretive procedures make results difficult to replicate. Interpretation
is often the most subjective part of the process. Indeed, many projective devices ap-
pear to allow wide-ranging judgments on the part of the examiner when scoring and
interpreting a client's results.
Subjectivity in scoring and interpretation inevitably leads to concerns over reli-
ability and validity of scores. Indeed, projective techniques display poor psychomet-
rics. Scorer reliability, test-retest, and internal consistency coefficients tend to be un-
acceptably low. As stated earlier, low reliability leads to low score validity, and the
research on projective score validity is, at best, inconclusive (Anastasi & Urbina,
1997).
Most projective tests have either absent or inadequate norms. When norms are
provided, the samples are often described in vague terms. In addition, often the com-
parison groups are not normal samples, but clinical populations, negating a valuable
comparison group for the determination of potential pathology; that is, if a client's
responses are compared with clinical patients and not "normal" individuals, how can
a clinician decide whether the client's responses are normal? Still, projective tech-
niques help to "flesh out" our understanding of clients in an open-ended manner
that is often missing in objective personality inventories.
296 Chapter 8
Projective techniques have been shown to be susceptible to outside influences,
such as examiner characteristics, examiner bias (i.e., theoretical orientation), or vari-
ations in administration directions. In addition, the validity of the "projective hy-
pothesis" itself has been called into question because responses may reflect state-de-
pendent characteristics rather than enduring personality characteristics. This is a
critical point, because the whole idea behind projective assessment is to access the
unconscious in order to understand the client's psychic determinism. If the client's
"present state of mind" is being measured rather than some enduring personality
structure, the goal of accessing the unconscious processes of the client's personality
is thwarted. The final limitation involves the difficulty (or impossibility) of actually
scientifically studying Freud's psychoanalytic developmental theory, given its empha-
sis on unconscious psychological processes. As psychoanalytic theory forms the basis
of projective testing, this limitation is quite significant.
As a final comment on projective techniques, Anastasi and Urbina (1997) sug-
gested projective techniques are better used as clinical tools rather than as tests per
se. Given the low standard of psychometric rigor, such a guarded approach is war-
ranted. Projectives are quite helpful when used for hypothesis generation and for
helping clients gain insight into unconscious needs and motivations, as well as aids
for qualitative interviewing, but their technical limitations mitigate against use for
diagnostic purposes.
SOME COMMONLY USED PROJECTIVE TECHNIQUES
Rorschach Inkblot Test
The Rorschach Inkblot Test (Rorschach, 1921/1998), originally developed by Hermann
Rorschach in 1921, is the best-known and most used projective test. The test's pur-
pose is to assess how a client perceives and organizes thoughts about the world. The
test is a Level C instrument and is individually administered to clients ages 5 and
older, in about 20 to 30 minutes (Hess, Zachar, & Kramer, 2001). It consists of 10
plates of bilaterally symmetrical inkblots (Janda, 1998): 5 are black and white; 2 are
black, white, and red; and the remaining 3 are comprised of pastel colors (Hess et al.,
2001). Clients are presented with the cards and asked what they think of the inkblot
or what it might be. In the second part of administration, clients are asked to explain
their original answers. Scoring and interpretation are frequently completed using a
scoring system originally developed by John Exner in the 1970s called the
Comprehensive System for Administering, Scoring, and Interpreting the Rorschach (Exner,
2002). Exner's multifaceted system involves interpretation of three aspects of re-
sponses: Location (W for the entire blot, D major portion of the blot, and Dd for un-
common responses); Determinants (there are nearly two dozen having to do with
shape, activity of humans, chromatic features, etc.); and Content (there are 26 cate-
gories used to interpret the content of the story). A Structural Summary is composed
based on an interpretive rating scale developed by Exner (Janda, 1998).
As with many projective tests, it is often hard to find concrete empirical data on
the Rorschach. Subjectivity is such a part of interpretation, and there can be definite
Personality Assessment 297
diversity in administration procedures depending on testing purpose and clinician
training. It has been noted that well-trained users of Exner's scoring system agree on
the major variables over 88% of the time (Hess et al., 2001). Still, there is substan-
tial debate over the interrater reliability of Exner's system. Exner purports that test-
retest reliability estimates are at or above r = 0.70 at both 1-year and 3-year intervals.
According to Hess et al. (2001), validity data of the Rorschach also yield many ques-
tions and concerns. Various questions of subjectivity arise based on administration,
scoring, and interpretation procedures. Still even with the lack of standardization
and empirical data, the Rorschach used in conjunction with Exner's Comprehensive
System (2002) is a better personality test than most opponents will acknowledge
(Hess et al.). Critics of the Rorschach point out that statistical prediction is usually
more accurate than clinical prediction (i.e., judgment), and the Rorschach relies pri-
marily on clinical prediction to measure personality. Far more psychometric research
needs to be done using the Rorschach, but it has the potential to generate meaning-
ful personality data (Hess et al., 2001).
Thematic Apperception Test (TAT)
The Thematic Apperception Test (TAT) (Murray & Bellak, 1973) is used to measure
various aspects of a client's personality. Clients are presented with 3 1 picture cards
and are asked to create stories based on the images. There is no time limit for this as-
sessment, and it can be administered to children and adult clients. Specific scoring
criteria are provided in the scoring protocol and assessment booklet. Many admin-
istrators choose 8 to 1 2 cards to use with a client. Six elements are considered when
examining stories: (1) the hero; (2) the needs or motives and feelings; (3) presses or
environmental forces; (4) outcomes; (5) recurring themes in the story; and (6) inter-
ests and sentiments (Janda, 1998).
According to Janda (1998), although several clinicians have determined new
scoring criteria for the TAT, most adhere to Murray's original scoring format. Janda
reported that this method can often be unstructured and biased, leading to inade-
quate score reliability and validity.
Children's Apperception Test-1991 Revision (CAT)
The G47"(Bellak & Bellak, 1992) assesses personality by interpreting story responses
to presented picture stimuli. The CATls administered to children ages 3-10 years in
about 15 to 20 minutes. The child is presented with stimulus cards that show ani-
mals engaged in human relationship— oriented interactions. The client then gives per-
ceptions, interpretations, and responses, and must solve developmental problems
(Knoff, 1998). The 10 stimulus cards address the following: feeding problems; oral
problems; sibling rivalry; attitudes toward parents; relationships to parents as (sexual)
couples; jealousy toward same-gender parent figures; fantasies about aggression; ac-
ceptance by the adult world; fear and loneliness at night; and toileting behavior and
parents' responses to it. There are 10 variables that are used to analyze responses:
Main Theme; Main Hero; Main Needs and Drives of the Hero; the child's
298 Chapter 8
Conception of the Environment; how the child sees and reacts to the figures in the
cards; Significant Conflicts described; the Nature of the Child's Main Anxieties; the
Child's Main Defenses; the Adequacy of the Superego as Manifested by Punishment
for Crime; and the Integration of the Child's Ego (Knoff, 1998). The assessment
comes with 10 additional cards that can supplement the G4r(Reinehr, 1998).
Specific scoring and interpretive instructions are included in the interpretive manual
(Knoff, 1998).
The authors state that there is no need for standardization or empirical data for
a projective test like the CAT, and few specifics are provided in the manual (Bellak
& Bellak, 1992). Due to the lack of statistical data, clinicians should be careful not
to base any clinical diagnosis or intervention on this assessment (Knoff, 1998).
Reinehr (1998) agreed that there is no basis in the argument of no need for empiri-
cal data on projective assessments.
Roberts Apperception Test for Children-Second Edition
(Roberts-2)
The Roberts-2 (McArthur & Roberts, 1994) is a projective test designed to measure
children's social perceptions. The test can be administered to children ages 6-15
years in about 20 to 30 minutes. The child is presented with 16 different test pictures
and is asked to tell a story about each one. Scoring criteria for each picture are pre-
sented in the manual and based on the presence or absence of certain characteristics
in the narrative. The three scales measured are Adaptive, Clinical, and Clinical
Indicators (Cosden, 2001). There are seven main constructs on which scoring crite-
ria are based, and each has several subconstructs. The seven main constructs are:
theme overview, problem identification, outcome, available resources, emotion, res-
olution, and unusual or atypical responses. According to the test's publisher, new
standardization studies were conducted and conformed to U.S. population demo-
graphics in terms of gender, ethnicity, and parental education, although specific in-
formation about the sample is not provided (Cosden, 2001), so generalizability is
questionable.
Although minimal information is available online for the Roberts-2, the manual
contends that validity for derived test scores is adequate. However, Waller (2001) as-
serted the original version of the test relies too heavily on doctoral dissertations and
findings are not published in refereed journals, making it difficult to evaluate score
validity. A new version of the Roberts-2 (McArthur & Roberts, 2005) became avail-
able in 2005.
House-Tree-Person (H-T-P) Projective Drawing Technique
The House-Tree-Person (H-T-P) Projective Drawing Technique (Buck, 1964) is a
widely used projective test that is easy to use and time-efficient (Western
Psychological Services, 2003b). It can be used lor clients ages 3 and older (see Figure
8.6). The client draws three objects (a house, a tree, and a person) and then de-
scribes, defines, and interprets the drawings. House-Tree-Person is often used as the
Personality Assessment 299
Figure 8.6 House-Tree-Person drawings by a teenager with AD/HD and fine-
motor coordination difficulties
300 Chapter 8
first test in an assessment for a counseling session, because drawing tends to reduce
tension. It is useful for assessing personality in people from different cultures, those
deprived of educational opportunities, and those developmentally delayed or non-
English-speaking; in addition, it is highly sensitive to the early presence of psy-
chopathology (Western Psychological Services, 2003b). Examiners must always be
careful to validate observations from projective techniques through other assessment
methods and not to overinterpret meanings of specific objects or designs drawn in a
picture.
Kinetic Drawing System for Family and School (KDS)
The Kinetic Drawing System for Family and School (KDS) (Knoff & Prout, 1985) is
designed to individually assess the frequency of a child's difficulties in the home and
school settings. The format allows the examiner to understand the overlap of behav-
iors and attitudes in both settings as well as to assess the source of certain attitudes
and behaviors. The KDS can be administered to clients ages 5-20 years. Clients are
asked to draw separate pictures of both family and school situations. Examiners are
asked to stress that each person in the picture should be doing something. There is
no time limit for this task but most complete the task in 20 to 40 minutes. Pictures
are assessed based on five categories: (1) actions of and between figures; (2) figure
characteristics; (3) position and distance of figures, and barriers between them; (4)
style; and (5) symbols (see Figures 8.7 and 8.8).
In a review of the manual, Cundick (1989) concluded reliability and validity
data are inadequate and that the studies provided are not related to the test protocol.
Weinberg (1989) stated that if administrators are well trained and scoring criteria
are clearly defined, good interrater reliability coefficients can be attained; however,
test-retest reliability coefficients are low. Weinberg concluded that although this test
is a wonderful icebreaker and rapport-building tool, one cannot recommend this as
an interpretive assessment yielding reliable and valid scores.
Forer Structured Sentence Completion Test (FSSCT)
The Forer Structured Sentence Completion Test (FSSCT) (Forer, 1967) is a 100-item
test used to determine a clients attitudes and views of the world by finding out in-
formation about a client's relationships and dynamics, and the client's use of evasive-
ness, individual differences, and defense mechanisms (Western Psychological
Services, 2003a). Separate forms are available for men, women, adolescent girls, and
adolescent boys. Administration of the test takes about 15 to 20 minutes and re-
quires a Level B qualification. A Checklist and Clinical Evaluation Form provides
evaluation tools that help the examiner to group clients into one of four categories:
(1) Interpersonal Figures; (2) Wishes; (3) Causes of Own (feelings and behaviors);
and (4) Reactions (to others) (Benet, 2005). Reliability, validity, and normative in-
formation is not given in the manual. Example prompts might include "My lather
makes me feel "; "I like to talk to my friends about "; Others often
think that I ."
Personality Assessment 301
Figure 8.7 Kinetic Family Drawing by a nonclinical teenage girl
Figure 8.8 Kinetic School Drawing by a nonclinical teenage girl
302 Chapter 8
SUMMARY/CONCLUSION
This chapter has provided an introduction to the information that professional
counselors need to engage in personality assessment. Both objective and projective
personality assessment were addressed. Objective methods typically involve trait ap-
proaches, and the five-factor model of Costa & McCrae currently enjoys popularity
among personality researchers. Numerous structured personality inventories are
available for use by professional counselors, including the NEO PI-R, CPI, PAI, and
MBTI.
Projective assessments present clients with ambiguous stimuli, and professional
counselors observe and assess how clients construct meaning and respond to these
stimuli. Projective techniques generally yield lower score reliability and validity than
objective personality measures. Projective techniques can be classified as association,
picture-story, verbal completion, choice arrangement, and production-expression
techniques.
KEY TERMS
association technique
choice arrangement techniques
drawing technique
personality
personality assessment
picture-story construction techniques
production-expression techniques
projective assessment
projective hypothesis
traits
verbal completion techniques
CHAPTER
9
Behavioral Assessment
by Carl J. Sheperis, R. Anthony Doggett, Masanori Ota,
Bradley T. Erford, and Carol Salisbury
This chapter provides a general understanding of behavioral assessment proce-
dures for professional counselors. More specifically, the chapter provides a gen-
eral definition of behavioral assessment as well as specific guidelines for con-
ducting behavioral assessment; details the two kinds of behavioral assessment (direct
behavioral assessment and nondirect behavioral assessment) and common tech-
niques used within these two assessment categories; and gives a brief overview of the
most commonly used behavioral assessment instruments.
WHAT IS BEHAVIORAL ASSESSMENT?
When children talk out loud during a class or see others become aggressive and rush
to fight, professional counselors may raise the following questions: Why does this
behavior occur? How can the behavior be changed? Behavioral assessment is a use-
ful methodology to clearly answer these questions.
Behavioral assessment is generally defined as "the identification of meaningful
response units and their controlling variables for the purposes of understanding and
of altering behavior" (Nelson, 1985, p. 45). Because a behavior occurs through an
interaction between an individual and the person's environment, professional coun-
selors use behavioral assessment to evaluate a particular behavior and the context in
which it occurs (e.g., stimuli or events affecting the behavior). Behavioral assessment,
along with other traditional assessment approaches (e.g., intelligence tests, personal-
ity tests), is widely used in various applied settings, such as schools, counseling cen-
ters, and other clinical venues.
303
304 Chapter 9
Defining Behavior
From a behavioral standpoint, all behaviors are seen as a direct result of external and
environmental stimuli. Although behaviors can be indicators of internal difficulties,
the professional counselor cannot readily measure or see those internal struggles.
Thus a key concept in behavioral assessment is that the target behavior (i.e., the be-
havior the client is trying to change) must be directly observable. For example, mil-
lions of people struggle to lose weight each year, and new diets emerge on the best-
seller list all the time. While the professional counselor may personally know what it
is like to have an internal battle over whether to eat a certain dessert, it would be
hard for a bystander to see or measure that internal struggle in a client. However,
through behavioral assessment, the professional counselor can identify a certain be-
havior that the client is trying to change (i.e., snacking on high-fat foods), measure
the number of times that the client snacks, the amount of food that is consumed,
and the amount of weight that is gained or lost. The professional counselor can then
develop an intervention that is clearly tied to the target behavior and accurately
measure changes in the behavior.
To obtain a clear picture of what the professional counselor and client are try-
ing to accomplish, an operational definition of a target behavior is addressed at the
beginning of behavioral assessment, using observable and measurable terms. A well-
developed operational definition contains an objective, concrete, and quantitative
description, with which anyone can clearly identify the observed behavior. In other
words, an operational definition must pass the "stranger test" — that is, any behavior
that one defines should be clearly understandable to a stranger. That stranger should
be able to pick up the definition and be able to observe someone without difficulty.
For example, it is not observable or measurable to state, "Sam continually snacks on
inappropriate foods," because it is not clear what "inappropriate" and "continually"
specifically mean in this situation. However, it is much clearer if an inappropriate
food is defined as "any food item containing more than 10 grams of carbohydrates,"
or "any food item containing more than 5 grams of fat." A good operational defini-
tion must also pass the "dead man test" — that is, the target behavior should not be
something that only a dead man could do. If a professional counselor developed an
intervention plan with the goal that Sam would not eat, that counselor would prob-
ably lose his or her license or be sued. It is impossible to ask someone not to eat. The
person would have to be dead to follow this guideline. In short, behavioral goals and
objectives should be MOP&D: measurable, observable, positive, and doable. Thus
an operational definition is crucial to minimize inferences during observation
(Sattler, 2002). To obtain reliable and valid data, it is important to maintain the same
operational definition throughout the assessment process.
Think About It 9.1 What behavior in your life would you like to
change? How could this behavior be operationally defined? Using this defini-
tion, what new behavioral goal could you set?
Behavioral Assessment 305
Guidelines for Conducting Behavioral Assessment
It should be noted that the professional counselor does not target personality traits
or psychopathology through behavioral assessment, because these things cannot
readily change through intervention. For example, a professional counselor can
change the frequency that a child displays tantrums (behavior) but cannot change
autism (a disorder), which some people might think causes the tantrums. Thus,
through behavioral assessment, the professional counselor focuses on the function
of particular behaviors that are within the client's voluntary control rather than a
diagnosis.
Behaviors often stem from interactions between an individual and the individ-
ual's environment. Thus, instead of examining a behavior in isolation, the profes-
sional counselor must consider environmental variables affecting the behavior (e.g.,
place, people, time, stimulus). Antecedents and consequences (events preceding and
following a behavior, respectively) and the characteristics of behavior (e.g., function,
magnitude, frequency, rate, duration, latency) are often measured in behavioral as-
sessment. For example, a great deal of attention has been focused on school violence
in recent years. On April 20, 1999, Eric Harris and Dylan Klebold killed a teacher
and 12 other students, wounded 23 other people, and then killed themselves at
Columbine High School in Littleton, Colorado. While it is clear that both students
were disturbed, it is important to understand the environmental variables and an-
tecedents leading to this tragedy. Harris kept a journal that helps us to understand
the environment's influence on his behavior. According to USA Today's online web-
site (Killer's diary reveals plans, 2001), Harris's journal paints a picture of an isolated
teen who was angry about being rejected. In his journal, Harris wrote, "I hate you
people for leaving me out of so many fun things. . . . You people had my phone #,
and I asked and all, but no no no no no don't let the weird looking Eric kid come
along." Because one can now look at some of the ways that rejection and isolation
affected Harris's behavior, schools across the country have implemented both pre-
ventive (e.g., peer counseling) and response measures (e.g., school safety plans). If
we only look at Harris and Klebold as disturbed teens and ignore the environmen-
tal factors leading to the tragedy, we would be unable to prevent future crises of this
nature.
In conducting behavioral assessment, it is also important to know that every be-
havior has its own purpose or function. When behavioral assessment is used to iden-
tify a function, it is called functional behavioral assessment (FBA). In accordance
with the Individuals with Disabilities Education Act Amendments of 1997, FBAs
and behavior plans are specifically required in schools for children who have a spe-
cial education ruling and are subject to disciplinary action.
Applied behavioral analysis researchers have identified four main variables that
may maintain or reinforce the performance of target behaviors: (a) attention, (b) tan-
gible, (c) escape, and (d) sensory stimulation (Alberto & Troutman, 2003; Iwata et
al., 1994). It should be noted that even if the topographies (i.e., what a behavior
looks like) of two behaviors are the same, the functions of the two behaviors might
be different. For example, when a child screams more after a teacher says, "Be quiet
306 Chapter 9
and look at me," the function may be attention from a teacher. However, escape may
be the function if a child often screams when difficult academic tasks are given dur-
ing a class. Also, one behavior may have more than one function (e.g., the functions
of the child's screaming may be both teacher attention and escape from difficult
tasks). Thus, once a function is hypothesized in functional behavioral assessment, it
should be experimentally verified through functional analysis using a single-subject
research design. Functional analysis is an experimental manipulation of environmen-
tal variables (e.g., antecedents, consequences) to establish a functional relationship
between a behavior and environmental variables. Discussion of functional analysis
and single-subject design are beyond the scope of this chapter, so interested readers
are referred to Alberto and Troutman (2003) and Miltenberger (2004).
METHODS OF BEHAVIORAL ASSESSMENT
Direct Assessment
Behavioral assessment is divided into two categories: direct assessment and indirect
assessment. In direct assessment, the professional counselor assesses events occur-
ring here and now through direct observation and client self-monitoring. In indirect
assessment, the professional counselor assesses past events using behavioral inter-
views, and self-report and informant-report behavioral checklists and rating scales.
Through direct observation, a professional counselor observes a client's behavior in
a natural setting and records it using a recording sheet. For example, a professional
school counselor may observe a child to assess how many times the child leaves the
seat during a class or talks to friends during a physical education period or recess on
the playground. Behaviors are often recorded using the following four methods: (1)
narrative recording, (2) interval recording, (3) event recording, and (4) ratings
recording (Sattler, 2002). This discussion is limited to the two most prominent
methods: narrative and interval recording.
Narrative recording
In narrative recording (see Table 9.1), the professional counselor records what is ob-
served anecdotally. The professional counselor may observe not only a behavior, but
also antecedents and consequences. Such observation, called ABC narrative recording
(for antecedent, behavior, and consequence), is used to identify relationships be-
tween a behavior and environmental variables (Bijou, Peterson, & Ault, 1968). It
can be useful to add an additional category to narrative recording: function. While
it is important to know the antecedents, behaviors, and consequences, it is equally
important to determine the functions of a behavior.
Interval recording
There are three primary methods of interval recording: (1) whole-interval record-
ing; (2) partial-interval recording; and (3) momentary time sampling. In each in-
terval recording method, the recording time is equally divided into intervals (e.g.,
Behavioral Assessment 307
Table 9.1 ABC narrative, observation format
A A wife asks her husband to help with the household chores.
B Husband pouts (i.e., speaks in short sentences, complains about the task, moves
slowly during the task).
C Wife tells husband, "Forget it. I'll just do the chores."
F Husband sought to escape task.
10-second intervals), and an observer records if a behavior occurs during each in-
terval. Specifically, in whole-interval recording, an observer marks each interval on
a recording sheet whenever a behavior occurs Throughout the interval, whereas in
partial-interval recording, an observer marks each interval whenever a behavior oc-
curs at least once anytime in the interval. In momentary time sampling, an observer
marks each interval whenever a behavior occurs at the beginning or end of the in-
terval. It should be noted that the occurrence of a behavior may be underestimated
in whole-interval recording, whereas it may be overestimated in partial-interval
recording.
Although direct observation demonstrates clear descriptions of behavior, its
characteristics, and environmental variables, some cautions are necessary. First, an
observer may be biased. For example, if a professional counselor is attending to more
than one behavior simultaneously, the professional counselor may pay more atten-
tion to some of the behaviors, but may miss others. Furthermore, because of habit-
uation, the observer may unintentionally change the operational definition or crite-
rion of a behavior (e.g., criterion frequency or duration), a factor called observer drift.
To prevent observer drift, interobserver agreement should be checked (for each type
of interobserver agreement and its calculation, see Kazdin, 1982). Also, an observer
should have periodic trainings to recall the operational definition, criteria of a be-
havior, and observation procedures.
Second, clients may change a behavior if they know they are being observed, a
factor called reactivity. For example, if children know they are being observed to de-
termine the frequency of talking without permission during a class, they may try to
remain quiet and follow the classroom rules. Clearly, in this case, an observer cannot
obtain data truly reflecting the behavior (i.e., talking without permission). An ob-
server may reduce reactivity by staying in the observation setting several times be-
fore recording observation data so that people become habituated to the observer.
With cautions to the potential pitfalls associated with observation procedures, direct
observation is often able to clearly draw the whole picture of behavior in natural set-
tings. Table 9.2 provides an example of an interval recording observation with rele-
vant operational definitions.
Self-monitoring
Self-monitoring is a method by which clients can observe and record their own be-
havior. Self-monitoring is an effective way to monitor infrequent behaviors (e.g.,
binge eating, self-injury) and internalizing problems (e.g., negative thoughts, anxiety,
308 Chapter 9
Table 9.2 Sample interval recording sheet with relevant operational definitions
Sample interval recording sheet
Behavior 10 min. 20 min. 30 min. 40 min. 50 min 60 min
Antecedents
Targets
Consequences
Operational definitions for interval recording sheet
ANTECEDENTS
D: Demand — Instruction to complete educational work or an assignment given to complete ("Get to work," "Turn your books to
page . . . ," teacher hands out a worksheet).
C: Command — Behavioral instruction ("Sit down," "Be quiet," "Go to your desk," "Stop talking," "Look at me").
T: Transition — Moving from one location to another in the classroom or school, switching from one assignment to another
(walking from the classroom to the lunchroom, moving from a desk to the reading area, switching from a math assignment
to a spelling assignment).
TARGET BEHAVIORS
OT: Off-task — Student's eyes are not directed toward the teacher for more than 5 seconds during a lecture, instruction, or
assignment.
OS: Out-of-seat — Student's bottom breaks contact with the seat or floor for more than 5 seconds.
IV: Inappropriate vocalizations — Student talks to teacher or peers without permission, student argues with teacher or peers,
student makes noises (whistling, howling, humming, clicking sounds).
CONSEQUENCES
El A: Escape! avoidance — Student is allowed to refrain from working on or completing the assignment, teacher takes assignment
away, teacher does not make student comply with (follow through on or complete) a command.
Teacher Attention
//' Teacher positive attention — Smiles, praise statements, proximity following appropriate behavior, physical touch for appropriate
behavior (pat on the shoulder, "Good job").
IN: Teacher negative attention — Frowns, reprimands, redirections, interruptions, proximity following problem behavior, physical
touch fot problem behavior ("Stop it!" "How many times have I told you to . . . ," tap on shoulder for talking without
permission).
Peer Attention
/'/'. Peer positive attention — Smiles, praises, proximity, physical touch for appropriate behavior.
I'N: Peer negative attention — Frowns, put-downs ("You're so . . . "), name calling ("dummy, butthead"), proximity following
problem behavior, physical touch following problem behavior (pushing, hitting, kicking, touching).
Calculation of Performance of Behavior From Interval Recording Sheet
OT: + 60x100= % of the intervals
OS:
60 x 100= % of i he interval
IV: + 60x100= % of the intervals
loi.il Disruptive Behavior: + 180 x 100= % of the intervals
Indirect Assessment
Behavioral Assessment 309
fear), which are difficult for others to observe (Sattler, 2002). For example, a client
who has depression may record any negative thoughts (e.g., "Although I study hard,
I am not smart enough to pass this course") every 30 minutes for a certain number of
hours. There are two matters to consider for self-monitoring. First, training a client to
effectively monitor behavior is critical, because the client needs to identify a target be-
havior precisely and record it appropriately (e.g., every 1 minute). Second, to increase
accuracy, it is effective for a professional counselor to monitor a client's behavior si-
multaneously and subsequently compare data with the client's self-monitoring. Also,
periodic feedback regarding procedures and accuracy of self-monitoring may further
promote accuracy.
The behavioral interview
The purpose of a clinical interview is to assess a client's global problems and related
history (e.g., family, medical, psychological, educational) for the purpose of arriving
at a diagnosis (Gresham, 1984). In contrast, the purpose of a behavioral interview
is to identify a target behavior; to analyze environmental variables affecting the be-
havior; and to plan, implement, and evaluate an intervention. Thus a behavioral in-
terview is a solution-focused interview that links assessment to intervention. A pro-
fessional counselor may interview not only a client, but also significant others (e.g.,
parent, caregiver, spouse, employer, teacher, peer) to obtain multidimensional infor-
mation about a client's problems from each individual's perspective. For example, a
wife may report that her husband appears distracted and depressed at home, but
peers may report that the man is upbeat and active at work. Further information
from the client's children reveals that the parents have been arguing more over the
last few months. While the root of the problem is not completely clear yet, it can be
determined that the man's behavior is limited to one setting. Thus a professional
counselor can now focus further assessment efforts around the marital relationship
and design more effective interventions because of the multidimensional informa-
tion derived from the interview.
Because of their brevity, self-report and informant-report behavioral checklists and
rating scales are commonly used methods of indirect assessment. In a self-report, a
client may either respond to written questions or directly answer a professional coun-
selor's questions regarding the nature of the client's concerns. However, in an inform-
ant report, significant others provide their perspective of the client's problems. For
example, using a self-report, a professional counselor may ask clients to rate the qual-
ity of their relationships with immediate family members. Through an informant re-
port, the professional counselor would ask significant others in the client's life to rate
the client's relationships with immediate family members. While these questions are
essentially the same, the results could be vastly different. Thus it is very useful to
compare the results of a self-report and an informant report
As in a behavioral interview, eliciting useful information in a self-report or
informant report often depends on the professional counselor's skills of verbal
310 Chapter 9
communication and strategic questions. For some clients, such as children or in-
dividuals with disabilities, the informant report plays an especially important role
in obtaining useful information on the client's problems. However, for reasons of
confidentiality, the client's consent is necessary before obtaining an informant re-
port. The professional counselor should be aware that responses on a self-report or
an informant report might not reflect actual problems precisely, because the re-
sponses represent human memories of past events. Intentionally or unintention-
ally, some clients or significant others may over- or underreport the severity of the
client's problems.
Behavioral checklists and rating scales offer a more standardized means of indirect
assessment and often have both self-report and informant-report versions available.
Many of the typical checklists and rating scales have a Likert scale format (i.e., rate
a behavior on a scale of 1 to 5) or some variation of this response style. For example,
to the statement "I did not sleep last night," there may be three response choices,
where represents Not at All, 1 represents Somewhat True, and 2 represents Very
True. While direct observations and interviews provide reliable information, stan-
dardized rating scales can provide normative information allowing professional
counselors to compare results of an individual to the population for which the in-
strument was developed.
When using rating scales or checklists, professional counselors should be cau-
tious of halo effects (e.g., tendency to rate a high-performing student as well-behaved
regardless of actual behaviors observed), and central tendency error (e.g., tendency to
respond with moderate or centrist descriptions rather than toward the extremes of a
rating scale). For example, some people may respond more mildly or severely than
their actual level (e.g., they may choose a number between 2 and 4 on a 5-point
Likert scale). Clients may respond this way because they are embarrassed about cer-
tain symptoms, have ulterior motives for representing themselves in a more positive
light, do not really understand the questions being presented, do not have the self-
awareness to respond accurately, or view the extreme rating choices as very extreme.
Thus, as is the case with any aspect of the assessment process, it is important for the
professional counselor to provide clear instructions, adequate details about the pur-
pose of the assessment, and information about the instrument, and to answer any
questions the client or informant may have. Despite the potential weaknesses with
behavioral checklists and rating scales, they are easy, inexpensive, and not time con-
suming. Also, some have been shown to reliably and validly screen or identify spe-
cific areas of disorders. For example, the Child Behavior Checklist/6-18 (CBCL/6-18)
(Achenbach & Rescorla, 2001) assesses the behavioral problems and adaptive func-
tioning of children ages 6 to 18 years. The CBCL/6-18 has 1 18 specific problem
items (and an additional 20 competence items). Each item consists of a 3-point scale,
on which 2 represents Very True or Often True, 1 represents Somewhat or
Sometimes True, and represents Not True (As Far As You Know). Normally a par-
ent or guardian can complete the CBCL/6-18 in approximately 1 5 minutes. Updated
information on the CBCL/6-18 and other Achenbach products is available on the
website of the ASEBA Products (www.aseba.org/products/lorms.html).
Behavioral Assessment 311
While professional counselors are strongly encouraged to follow the best prac-
tices for assessment as outlined in this text, the fact remains that assessment can be
a time-consuming process. The reality is that professional counselors are often re-
stricted in the amount of time they can dedicate to assessment. Thus it is important
to have various methods of gaining reliable and valid information about a client's
presenting problems in a relatively short time. Self-report and informant-report be-
havioral checklists and rating scales are practical assessment tools to identify problem
behaviors and to obtain multidimensional information from clients and their signif-
icant others. When selected thoughtfully, respondents to checklists and rating scales
provide valuable, accurate, cost-effective, and time-effective insight into client be-
haviors from the naturalistic settings.
Think About It 9.2 Why would it be beneficial for a professional coun-
selor to use both direct and indirect assessment approaches when evaluating
a client?
BEHAVIORAL RATING SCALES AND INVENTORIES
USED IN COUNSELING
The lines between clinical, behavioral, and personality assessments are quite blurred,
the overlap in functions is sometimes pronounced. While there are innumerable as-
sessment tools available for use by professional counselors, the tests and inventories
that follow in this chapter are among the most commonly used for indirect assessment
of behaviors. An overview of the format and psychometric properties of each instru-
ment is provided. Hopefully, these reviews will help in the selection process of other
instruments as well. It is important to note that only a few of the hundreds of avail-
able rating scales are reviewed below, but the skills in understanding and using tests
garnered from this text will help the reader evaluate, select, and use other instruments.
Conners' Rating Scales-Revised (CR5-R)
The Conners' Rating Scales — Revised (CRS-R) (Conners, 1997) is a multi-informant
inventory designed to assess psychopathology and problem behavior in children and
adolescents ages 3 to 17 years. It can be completed by parents and teachers, and can
also be self-reported by adolescents. The CRS-R is available in four primary formats
based on length and respondent: (1) a short form (27 items) of the Conners' Parent
Rating Scales — Revised (CPRS-R:S); (2) a short form (28 items) of the Conners'
Teacher Rating Scale — Revised (CTRS-R:S); (3) a long form of 80 items for parents
(CPRS-R.-L); and (4) a long form of 59 items for teachers (CTRS-R:L). An adolescent
self-report form, the Conners-Wells Adolescent Self-Report Scale (CASS), is available in
long (CASS-L) and short (CASS-S) forms (Conners & Wells, 1997), and an adult
form, the Conners' Adult ADHD Rating Scales (CAARS) (Conners, Erhardt, &
312 Chapter 9
Sparrow, 1999), is also available. Items measure such facets as Oppositional, Social
Problems, Cognitive Problems/Inattention, Psychosomatic, Hyperactivity, Symptom
Subscales, Anxious-Shy, ADHD Index, Perfectionism, and a Conners' Global Index
(Conners, 1997). In addition, the long forms provide two DSM-IV subscales
(Inattention, Hyperactive-Impulsive), scored in a straight symptom count or in com-
parison to norms. Sample items from the CPRS-R.S include "Argues with adults,"
"Irritable," and "Deliberately does things that annoy other people." The CRS-R can
be completed using pencil and paper in 5 to 10 minutes for the short version and 10
to 20 minutes for the long version. This inventory can be completed by computer,
remotely, or over the telephone, and is available in English, Spanish, and French-
Canadian languages.
The normative sample for the CRS-R consisted of over 8,000 cases in a large
database compiled from over 200 collection sites throughout North America
(Conners, 1997). This inventory requires Level B instrument qualifications and is
written at the lOth-grade reading level for the parent and teacher forms, and at the
6th-grade level for the long-form adolescent self-report (CASS:L). Subscale internal
consistency coefficients are satisfactory, ranging from 0.73 to 0.94 for the CPRS-R.L;
0.86 to 0.94 for the CPRS-R.S; 0.77 to 0.96 for the CTRS-R.L; 0.88 to 0.95 for the
CTRS-R.S; and 0.75 to 0.92 for the CASS:L (Conners, 1997). Raw scores are con-
verted into T scores and percentiles. The various versions of the CRS-R are helpful
because they display AD/HD-type behaviors and track therapeutic progress
(Giarnarris, Golden, & Greene, 2001; Townsend, Baylot, & Erford, 2006). Hand
scoring of the protocols is easy using pressure-sensitive carbonless paper, but com-
puter scoring and mail or fax scoring are also available. Clinicians need to be cautious
when using this inventory for African American clients, because this group was un-
derrepresented in the parent sample. It is an excellent screening device for AD/HD
and general childhood psychopathology.
Attention Deficit Disorders Evaluation Scale-Third Edition
(ADDES-3)
The Attention Deficit Disorders Evaluation Scale — Third Edition {ADDES-3)
(McCarney & Arthaud, 2004a, 2004b) was designed to assess symptoms of AD/HD
(inattentiveness, hyperactivity, impulsivity) in children and adolescents ages 4 to 18
years. It is available in two versions: a Home Version of 46 items for parent report
(ADDES-3-HV) and a School Version of 60 items for teacher report (ADDES-3-SV).
Each version consists of two subscales: Inattentive and Hyperactive-Impulsive. A
child's demonstration of a given behavior is rated on a 6-point scale: — Not
Developmentally Appropriate for Age; 1 — Not Observed; 2 — One to Several Times
per Month; 3 — One to Several Times per Week; 4 — One to Several Times per Day;
5 — One to Several Times per Hour. Such a rating system allows for substantial speci-
ficity in determining the frequency of display of a given behavior (Demaray &
Elting, 2003). The ADDES-3 is a Level B test and generally requires 15 to 20 min-
utes to administer and score. Scoring can be accomplished by hand or computer.
Behavioral Assessment 3 1 3
Raw scores can be converted to scaled scores (M= 10; SD = 3) and percentile ranks.
Lower scaled scores or percentile ranks indicate higher levels of inattentiveness or hy-
peractivity of the client (Erford, 2006).
The standardization sample generally conformed to the 2000 U.S. Census pop-
ulation demographics. However, the School Version had a lower percentage of White
participants (62.42%) than the national sample (71.89%), and both the School
Version and the Home Version contained higher numbers of Black participants
(24.64% and 15.13%, respectively) versus the national sample (12.14%). The
ADDES-3-SV age category coefficient alphas for the Inattentive subscale ranged
from r = 0.89 to r = 0.98 (median = 0.98); the Hyperactive-Impulsive subscale
ranged from r = 0.89 to r = 0.99 (median = 0.98); and the overall quotients ranged
from r = 0.98 to r = 0.99 (median = 0.99). The ADDES-3-HV coefficient alphas for
the age categories of the Inattentive subscale ranged from r = 0.90 to r = 0.97 (me-
dian = 0.96); the Hyperactive-Impulsive subscale ranged from r = 0.95 to r = 0.97
(median = 0.96); and the overall quotient ranged from r = 0.96 to r = 0.98 (median
= 0.98). However, it is not stated whether coefficients were derived from raw scores
or standard scores. If raw scores served as the basis for reliability coefficients, the es-
timates would be inflated (Erford, 2006). Therefore, further analysis using standard
scores should be conducted.
Criterion-related validity studies provided in the manual used the ADDES-2,
which contained very similar items in most regards. Bussing, Schuhmann, and Belin
(1998) found the ADDES-2 produced a significant number of false positives and
false negatives and that the results for girls were more accurate than those for boys.
Overall, the psychometric characteristics of the ADDES-3 appear adequate for
screening symptoms of AD/HD. Ancillary publications have been developed, in-
cluding The Parents' Guide to Attention Deficit Disorders — Second Edition (McCarney
& Baker, 1995) and the Attention Deficit Disorders Intervention Manual — Second
Edition (McCarney, 1994). Klecker (2001, p. 91) was quite critical of these supple-
ments, however, stating that the materials were "too fragmented to be either read-
able or helpful. The supplements would be more useful with age-specific scenarios
and practical examples."
Behavior Assessment System for Children (BASC)
The Behavior Assessment System for Children {BASC) (Reynolds & Kamphaus, 1992;
1998) was designed to aid in the identification and diagnosis of emotional and
behavior disorders in children and adolescents ages 2.5 to 18 years. It is a multi-
informant, multi-assessment battery composed of five components: (1) Teacher
Rating Scales (TRS); (2) Parent Rating Scales (PRS); (3) Self- Report of Personality
(SRP); (4) Structured Developmental History (SDH); and (5) Student Observation
System (SOS). Items on the TRS and PRS utilize a 4-point frequency rating rang-
ing from Never to Almost Always. These components yield 4 composite scores
(Internalizing Problems, Externalizing Problems, Adaptive Skills, and the
Behavioral Symptoms Index) as well as 10 scale scores (Aggression, Hyperactivity,
314 Chapter 9
Anxiety, Depression, Somatization, Attention Problems, Atypicality, Withdrawal,
Adaptability, and Social Skills). For each component, administration and scoring
range from 10 to 30 minutes. The standardization samples generally conformed to
the U.S. population demographics.
The internal consistency coefficients for the TRS composites are generally high,
ranging from r = 0.88 to r = 0.95 for the younger preschool age group, and from
r = 0.90 to r = 0.96 for the older preschool age group. Somewhat lower were the co-
efficients of the scales for both the younger (r = 0.71-0.92) and the older preschool
age groups (r = 0.78-0.90). Test-retest studies, with a maximum of 2 months be-
tween administrations, yielded correlations ranging from r = 0.90 to r = 0.95 (com-
posites) and from r = 0.82 to r = 0.95 (scales) (Erford, 2006). Validity evidence for
the BASC is based on factor analysis of its theoretical model, correlations with sim-
ilar tests, and correlation matrices between the TRS and PRS. The BASC psychome-
tric characteristics are quite sound, and it appears a robust measure for screening
emotional and behavior symptoms (Witt & Jones, 1998), but Erford (2006) urged
the use of subscale results only for hypothesis generation and validation, not diagno-
sis, due to lower technical adequacy. Sandoval (1998) also indicated that the stan-
dardization sample overrepresented children from Catholic and university-affiliated
schools. Finally, Wilder & Sudweeks (2003) indicated that a lack of specific psycho-
metric data on culturally diverse subpopulations indicates the need for caution when
assessing and making decisions about culturally diverse youth.
Disruptive Behavior Rating Scale (DBR5)
The Disruptive Behavior Rating Scale (DBRS) (Erford, 1993) was designed to pro-
vide quick, meaningful information regarding disruptive behaviors displayed by chil-
dren ages 5-10 years. It assesses symptoms associated with distractibility, impulsive-
hyperactivity, oppositional behavior, and antisocial conduct. The DBRS can be used
as a preliminary screening tool, as part of a medical, psychological, or psychoeduca-
tional evaluation, to target specific behaviors, or as a pretest-posttest measure of
intervention effectiveness (Erford, 1993). It is available in two versions (teacher and
parent), and separate norms are provided for teachers, mothers, and fathers. To
eliminate cross-respondent confounds, each version of the DBRS contains 50 items
with identical wording. All items are answered based on a 4-point frequency scale:
— Rarely/Hardly Ever; 1 — Occasionally; 2 — Frequently; and 3 — Most of the
Time. The DBRS generally requires 5 to 10 minutes for administration and is easily
scored by hand or by computer (McKechnie, 2006). Raw scores are transformed into
T scores, percentile ranks, and three interpretive ranges: Abnormal (T > 66);
Borderline (60 < T < 66); and Normal (T < 60). The standardization sample under-
represented minorities, rural residents, and individuals whose parents had lower lev-
els of education (Erford, 1993).
Cronbach's alpha reliability coefficients for the DBRS subscales were well above
the minimum acceptable level (r > 0.80, discussed in Chapter 3) for the Distractible
(r = 0.92-0.95; median = 0.92); Oppositional (r = 0.86-0.96; median = 0.88); and
[mpulsivc-Hypcractivity (r = 0.88-0.96; median = 0.92) subscales. However, the
Behavioral Assessment 3 1 5
Antisocial Conduct subscale coefficients were substantially lower (r = 0.67-0.77;
median = 0.73), most likely because it contains only four heterogeneous items.
Similar results were found for 30-day test-retest studies. The DBRS's content, con-
struct, and criterion-related validity when compared to factors in other tests were
moderate to high (Erford, 1996, 1997a, 1998; McKechnie, 2006). Table 9.3 pro-
vides a sample of output from the DBRS computerized scoring and interpretation
system.
Coping Inventory for Stressful Situations (CISS)
The Coping Inventory for Stressful Situations (CISS) is a 48-item self-report inven-
tory used to assess three major coping styles: (1) task-oriented, (2) emotion-
oriented, and (3) avoidance-oriented. Each coping style is assessed through 16
items. The CISS is based on Endler's (one of the authors of the CISS) multidimen-
sional interaction model of stress, anxiety, and coping. According to Endler (1997),
task-oriented coping contains efforts such as problem solving and situation chang-
ing, whereas emotion-oriented coping contains self-oriented responses such as emo-
tional reactions, self-preoccupation, and fantasizing. Avoidance-oriented coping
contains activities or cognitive changes to avoid stressful situations (for details of
the multidimensional interaction model and the three coping styles, see Endler,
1997). There are two versions of the CISS: an Adolescent version (ages 13-18) and
an Adult version (ages 18 and older). Paper-and-pencil record forms called
"QuikScore™" are available. A 21-item brief format for adults (CISS: Situation
Specific Coping [CISS:SSC]) is also available to assess coping style in situations in-
volving social evaluation and interpersonal conflicts (Multi-Health Systems, Inc.,
2003). Current cost information and online ordering information are available on
the website of Multi-Health Systems (2003).
Each item of the CISS is formatted on a Likert scale ranging from 1 (Not at All)
to 5 (Very Much). The CISS takes approximately 10 minutes to complete and has a
Level A qualification for administration and interpretation. An examiner scores the
CISS using a scoring grid and obtains a percentile rank and T score using a profile
sheet on the back side of the scoring grid. Provided scales are Task, Emotion, and
Avoidance. Avoidance consists of two subscales: Distraction and Social Diversion.
Norms are provided for adults and adolescents (Tirre, 2003). For adults, sepa-
rate male and female norms are provided for general-population and psychiatric pa-
tients, respectively. For adolescents, separate norms are provided for individuals ages
13-15 years and 16-18 years. Separate college student norms are also available.
Endler (1997) found sufficient internal consistency and test-retest reliability for the
CISS. Endler ( 1 997) also found the scores of the CISS to be valid. Through an ex-
amination of construct validity, Endler discovered that some CISS scales were signif-
icantly correlated with related measures, such as the Beck Depression Inventory (BDI)
and the Eysenck Personality Inventory (EPI).
Professional counselors interested in using the CISS are encouraged to explore
Endler's multidimensional interaction model prior to use. Endler (1997) insisted on
the necessity of examining not only the interaction between person and situation
3 1 6 Chapter 9
Table 9.3 Computerized DBRS report for a 7-year-old boy named Billy
Respondent's name: Mrs. Jones, his teacher
Summary Statistics and Critical Analysis Tables
Scale Raw score SEM
T score; Range °/oile Rank; Range Range of significance
Distractible
21
4
67; 63-71 96; 91-98
Borderline to Abnormal
Oppositional
5
6
55; 49-61 69; 47-87
Normal to Borderline
Impulsive-Hyperactive
21
5
74; 69-79 99; 97-99.81
Abnormal
Antisocial conduct (Aux)
1
13
55:42-68 69; 21-96
Normal to Abnormal
Critical Item Analysis
Scale
Item
Statements
Distractible
8
22
31
Doesn't seem to remember what is said.
Has difficulty following simple instructions.
Does not finish activities undertaken.
Oppositional
None
Impulsive- Hyperactive
3
6
10
13
21
34
42
Calls out unexpectedly.
Fidgety.
Finds it hard to await turn in group situations.
Restless, squirmy.
Interrupts.
Has difficulty sitting still.
Finds it hard to play quietly.
Antisocial conduct (Aux)
None
Interpretation
The Disruptive Behavior Rating Scale — Teacher Version (DBRS- T) is a 50-item inventory of common childhood behaviors
associated with distractible, impulsive-hyperactive, oppositional, and antisocial behavior. Mrs. Jones's responses to the DBRS-T
indicate that Billy is observed to perform in the Borderline to Abnormal range of distractible behavior. Billy is more distractible
than approximately 96% of boys his age. Billy is having particular difficulty remembering what is said, following simple
instructions, and finishing activities undertaken.
Billy is observed as performing in the Abnormal range of impulsive-hyperactive behavior. Billy is more impulsive-
hyperactive than approximately 99% of boys his age. Billy displays a significant inclination toward calling out unexpectedly,
fidgeting, not awaiting turns in group activities, restless squirming, interrupting, difficult)' in sitting still, and difficult)' in playing
quietly.
Additionally, Billy performs in the Normal to Borderline range of oppositional behavior. Billy is more oppositional than
approximately 69% of boys his age. No critical items were determined for this factor.
Finally, Billy is rated to perform in the Normal to Abnormal range of antisocial conduct. Billy is more antisocial than
approximately 69 percent of boys his age. No critical items were determined for this (actor.
A diagnosis of Attention-Deficit Hyperactivity Disorder (AD/HD) should be considered. Validation of these findings
through multiple methods of evaluation and multiple informants is recommended.
Behavioral Assessment 3 1 7
variables, but also the interaction within person variables (e.g., cognitive style, bio-
logical variables) and situation variables (e.g., stressful events, physical environ-
ments), given that "stress, anxiety, and coping all involve complex processes and all
interact with one another" (Endler, 1997, p. 149).
SUMMARY/CONCLUSION
KEY TERMS
Professional counselors should remember that the referral question should always
drive the assessment process. All too often, assessment reports are driven by a one-
size-fits-all approach. It is important to gather data from multiple methods and mul-
tiple informants to evaluate how the identified individual differs from other individ-
uals in the population (nomothetic comparisons) and to identify specific targets for
remediation or therapy. Professional counselors should use a combination of behav-
ioral interviews, rating scales, inventories, and direct observations to obtain a com-
prehensive picture of the client and the specific referral concerns. Doing so not only
provides appropriate services but constitutes best practices for ethical and legal obli-
gations of service provision in the area of assessment.
behavioral assessment indirect assessment
behavioral interview interval recording
direct assessment narrative recording
functional behavioral assessment operational definition
(FBA) self-monitoring
CHAPTER
10
Assessment of Intelligence
by Bradley T. Erford, Lauren Klein, and Kathleen McNinch
Intelligence is an important human characteristic with robust applications to the
areas of academic achievement, career development, and psychopathology. There
is no commonly accepted definition of intelligence, and numerous models have
been offered to explain and measure this construct. This chapter explores these mod-
els and reviews many of the individual and group-administered tests designed to
measure intelligence. In addition, important societal and educational issues and im-
plications are discussed.
WHAT IS INTELLIGENCE?
"She's really smart." "He's about as bright as a burned-out light bulb." "She should
aspire to raise her IQ to room temperature." "He's brilliant, simply brilliant!" At
some time, most people have overheard (or perhaps made) a judgment about their
own or someone else's probable level of intelligence. For more than a century, theo-
rists and test developers have attempted to define and operationalize "intelligence."
In 1921, 17 experts responded to an invitation by the editor of the Journal of
Educational Psychology to define and describe their perspectives on intelligence. In
1986, Sternberg and Detterman similarly consulted leading experts in the field. The
result in both cases: The experts revealed great diversity and little commonality in
their conceptions of what intelligence entails. Charles Spearman (1927, p. 14), a fa-
mous theoretician and researcher in the field of intelligence, pessimistically con-
cluded, "In truth, intelligence has become ... a word with so many meanings that
finally it has none."
319
320 Chapter 10
Various definitions of intelligence emphasized at least one of the following com-
ponents (Sax, 1997): (1) origin — whether intelligence is inherited, learned, or both;
(2) structure — its traits, facets, or components; and (3) function — its purpose, usually
to aid in adjustment or survival. In a broad sense, intelligence is a human-contrived
construct used to explain one's (genetic and/or learned) abilities to reason through
and solve problems or dilemmas of importance to human adaptation. And as if
defining intelligence isn't challenging enough, measuring it is even harder! The pre-
mier challenge confronting researchers and test developers in the field of intellectual
assessment is to operationally define the construct of intelligence from often-diver-
gent theoretical perspectives. Therefore, nearly all tests of intelligence available for
use today measure some conception of cognitive capability, but each does so from a
somewhat different perspective.
The term intelligence testing is virtually synonymous with the terms cognitive
ability testing and mental ability testing. However, the term aptitude, while overlap-
ping in many ways with intelligence, is a concept that implies a more specialized use
of intellectual, perceptual, and motor abilities — usually with vocational or educa-
tional applications. The area of aptitude assessment will be covered in further detail
in Chapter 1 1 . Intelligence testing is undertaken to estimate a client's ability to com-
prehend and express verbal information; to solve problems through verbal or non-
verbal means (i.e., spatial, figural, visual); to learn and remember information (i.e.,
short-term, long-term); and to assess information processing efficiency. In short, in-
telligence is a useful and robust concept with widespread clinical applications.
While professional counselors may not frequently be the professional adminis-
tering a given intelligence test, it is essential that professional counselors understand
the nature of intelligence, the practical features of intelligence tests, and how these
tests are used for clinical and educational decision making and for treatment and re-
medial planning. For example, professional school counselors and other educational
personnel use intelligence tests to help determine a student's eligibility for special ed-
ucation services under the Individuals With Disabilities Education Improvement Act
(IDEIA), and often for educational accommodations under Section 504 of the U.S.
Rehabilitation Act of 1973. Mental health and community counselors use intelli-
gence test information to establish effective treatment plans and to advocate on be-
half of clients with special needs. Career and professional school counselors use in-
telligence test information to help students and clients with educational planning
and career choices. Intelligence test results are helpful decision-making tools appli-
cable across a wide gamut of life decisions.
Think About It 10.1
working with students?
low could intelligence testing be beneficial in
Unfortunately, there is no widespread consensus over the definition of intelligence.
Various researchers and test developers have conceived of very diverse theories of,
and perspectives on, intelligence. Indeed, one could support the assertion that each
Assessment of Intelligence 321
intelligence test published and available today has a somewhat different theoretical
underpinning. The differences are frequently slight, at other times vast. But keep in
mind, all purport to measure this concept referred to as "intelligence."
NATURE AND THEORIES OF INTELLIGENCE
For more than a century, numerous researchers and test developers have attempted
to define the construct of intelligence. While there is great diversity in these concep-
tions of intelligence, typically intelligence tests measure, to a greater extent, verbal
abilities and, to a lesser extent, abstract visual reasoning and quantitative skills. There
is also general agreement that speed and efficiency of problem-solving capacities are
characteristic of individuals with higher levels of intelligence (Jensen, 1985).
Snyderman and Rothman (1986) surveyed 661 testing authorities, virtually all of
whom agreed that intelligence involves, at a minimum, capacity to acquire knowl-
edge, abstract reasoning, and general problem-solving capabilities. Some (e.g.,
Gardner, 1 983) even integrate personality variables into their definition. What fol-
lows is a brief exploration of some conceptualizations, theories, and models of intel-
ligence developed over the past century. Note how the construct of intelligence has
at times evolved from simpler to more complex explanations, while at other times
divergent pathways have led to new theoretical models and orientations.
Historical Conceptualizations of Intelligence
In the late 19th century, Sir Francis Galton and James McKeen Cattell believed in
the importance of sensory acuities and capabilities as indications of intellectual
prowess, because all information about the external world (and thus all potential
learning) entered through the senses. To their way of thinking, the more highly de-
veloped and attuned one's senses, the more intelligent one could become. While
plausible on its surface, such a perspective fails to account for thinking or reasoning
processes. In 1890, Cattell coined the term mental test, giving rise to the field of
study now known as intellectual assessment. Unfortunately, from early on, other re-
searchers (Wissler, 1901) demonstrated that the type of "intelligence" Cattell and
others were proposing had little relationship to academic performance, failing to ex-
plain why some students, particularly at the university level, do better or more poorly
than others. Interestingly, Wissler's results later were criticized for using a sample
with a restricted range of ability — a flaw that suppresses the magnitude of a correla-
tion coefficient — as is discussed on the companion website.
From the early reliance on sensory processing, definitions of intelligence evolved
with a heavier focus on internal thinking and reasoning processes. At the same time,
however, the concept of intelligence was also discussed primarily as a general, unidi-
mensional construct.
Alfred Binet
Alfred Binet defined intelligence as the "tendency to take and maintain a definite di-
rection; the capacity to make adaptations for the purpose of attaining a desired end"
322 Chapter 10
(as cited inTerman, 1916a, p. 45). Binet and Henri (1895a, 1895b, 1895c) studied
facets of human intelligence that were far more complex and less easily measured
than the simple sensory functions observed by Galton, including tasks of reasoning,
comprehension, memory, judgment, and abstraction (Varon, 1936). Binet believed
distinct thinking abilities were integrated into a general ability that was called on
when solving problems. Thus, when one is solving a problem such as, "What should
you do if your boat begins to sink in the middle of a large lake?" Binet believed that
it was difficult to sort out the influence of, say, practical experience, memory, rea-
soning, and verbal facility in the construction of an acceptable answer. This prelim-
inary research led to the development of the first functional individual intelligence
test by Binet and Simon (1905).
David Wechsler
Wechsler (1955, p. 7) once wrote:
Intelligence, operationally defined, is the aggregate or global capacity of the indi-
vidual to act purposively, to think rationally, and to deal effectively with his en-
vironment. It is aggregate or global because it is composed of elements or abili-
ties which, though not entirely independent, are qualitatively differentiable. . . .
The only way we can evaluate it quantitatively is by the measurement of the var-
ious aspects of these abilities.
In 1939, Wechsler developed a test to measure the intelligence of individual
adults. His test was composed of a collection of subtests adapted from the Army
Alpha and Beta tests from World War I. His verbal subtests were modeled from items
off the Army Alpha, and his performance subtests were modeled after items off the
Army Beta. Combining the scores from the verbal and performance subtests yielded
a full-scale intelligence estimate that Wechsler believed a good representation of g.
However, the development of the original test and its various revised editions were
driven more by clinical practice and implications than by theoretical considerations.
Wechsler clearly acknowledged a general factor (g) composed of multiple com-
ponents, and his intelligence tests, which will be discussed later in this chapter, have
become the most commonly used in history. However, it is important to remember
that while Wechsler stressed the essential role of cognitive capabilities in intellectual
capabilities, he also recognized that a comprehensive understanding of intelligence
involved noncognitive capacities, including "capabilities more of the nature of con-
native, affective, or personality traits . . . such as drive, persistence, and goal aware-
ness . . . [and] ... an individual's potential to perceive and respond to social, moral,
and aesthetic values" (Wechsler, 1975, p. 136).
Piaget's developmental model
Swiss developmental psychologist Jean Piaget has made important theoretical contri-
butions to the understanding of childhood intelligence (1954, 1971). Piaget believed
that the function of intelligence was to help humans to adapt to the environment.
As individuals become more intelligent, they progress through more advanced levels
of symbolic representation. Eventually, physical trial and error is replaced by mental
Assessment of Intelligence 323
trial and error. To Piaget, 'learning was a consequence of an individual's interacting
with the environment and encountering dilemmas that required mastery through a
reorganization of thought. These organized structures were called schemata. Infants
are born with some schemata (i.e., sucking, grasping) and learn about the environ-
ment by coordinating these schemata to take in new information (Cohen &
Swerdlik, 1999). Thus infants may grasp objects and place them into their mouths
to more fully appreciate the object. Eventually, schemata of greater and greater com-
plexity develop, departing from sole reliance on the physical realm and leading to
cognitive transformations. As the individual interacts with the environment, existing
schemata are constantly being refined, and new schemata are formed.
Piaget proposed two methods by which humans organize their cognitive struc-
tures and adapt to new contexts. Assimilation is the process by which individuals
make sense of new information in terms of a structure or process that already exists.
For example, small children generally know what a dog is and know a dog when they
encounter it or see a picture of it. Every time they see a dog and recognize it as a dog,
they are assimilating this information — making sense of it in terms of an existing
structure. New information is related to old structures. Accommodation is the
process by which individuals make sense of new information by changing the exist-
ing structure or process, thus creating an adapted schema. For example, eventually
children recognize that there are different types of dogs (i.e., golden retrievers, bea-
gles, miniature poodles) and restructure the previously existing category (e.g., they're
all dogs) into new, more meaningful categories designated by using diverse dog breed
names. Thus new information is reorganized into new ways of thinking.
Piaget was also instrumental in developing the now common assumption that
there are qualitative differences in the way children think at various ages. His theory
proposes four stages of development. The sensorimotor stage occurs during the first
several years of life, and cognitions of the infant and toddler are basically limited to
the sensory processes in the immediate environment (i.e., touch, taste, smell, sight,
hearing). The preoperational stage generally occurs between ages 2 and 7 years and
evolves from the child's emerging ability to reason symbolically (i.e., to use words to
symbolize objects). The concrete operational stage involves the beginnings of logical,
systematic thinking and generally develops between ages 7 and 12 years. Concepts
such as conservation and reversibility of operations are important, but problem solv-
ing is still predominated by direct, immediate experiences. Piaget's highest level of
reasoning was called the formal operational stage. Generally emerging around age 12
years, this period is marked by problem-solving strategies that rely on increasing lev-
els of abstract, systematic, hypothetical thinking. Individuals can evaluate their own
thought processes (metacognition) and more easily see how several variables relate,
interact, and can be used to learn from and predict. Piaget's theory has been very in-
fluential in education, influencing curricular activities, materials, and programs.
Think About It 10.2 How would being aware of Piaget's stages of devel
opment be useful when working with children?
324 Chapter 10
Verbal ability (s,)
Visual-Spatial reasoning (s 4 )
General intelligence factor (g)
Quantitative ability (s 2 )
Mechanical skills (s 3 )
Figure 10.1 Spearman's two-factor theory of intelligence
Spearman's general-factor theory (g)
A British statistician and psychologist named Charles Spearman, the innovator of a
useful statistical technique now known as factor analysis, proposed a theory of intel-
ligence that is referred to as both a "two-factor theory" and a "general factor theory"
of intelligence. His theory proposed that a general factor (g) stands at the center of
one's cognitive capacity, and that (perhaps numerous) specific factors (j,, * 2 > ^3. • • •
s n ) are related to the general factor and help explain nuances and specialized charac-
teristics observed in individuals. Spearman noticed that all measures of intelligence
were positively correlated with academic performance, leading him to think that a
common construct (the general ability factor [g]) underlay these measures and cre-
ated the positive associations. Figure 10.1 provides a pictorial representation of
Spearman's theory.
Spearman also noticed that as he began to aggregate (i.e., add together) scores
obtained on the simple sensory tasks and the reasoning and comprehension tasks
commonly associated with intelligence at that time, the measures correlated in the
0.30s with academic performance (Francher, 1985), substantially enhancing the pre-
dictive usefulness of these tasks. Spearman became convinced that all measures of in-
telligence were simply facets related to the general intelligence factor (g). Thus two
tests measuring different facets of intelligence would overlap to some extent, depend-
ing on the strength of their relationship tog. He reasoned that if all intelligence tests
measured only general mental ability, the correlations between these tests would ap-
proach r = 1.00. However, because these correlations were significantly less than r =
1 .00, he assumed that a diverse set of specific factor elements (s x , s 2 , etc.) were what
prevented the perfect correlations. "Spearman referred to g as the total mental en-
ergy available to a person while the s factors were the engines through which this en-
ergy was applied" (Janda, 1 998, p. 209). Some cognitive tasks required more general
ability (g) than others, but all cognitive tasks required at least some. Spearman's two-
fa< tor theory of intelligence was an important advance but was far from universally
accepted.
Assessment of Intelligence 325
Multiple-Factor Models
Multiple-factor theories propose that one's intellectual makeup is composed of
many components that are more or less independent of each other. For example,
while most people have normal or average verbal and visual-spatial reasoning abili-
ties, others may be weak in both areas, and still others may be strong in both areas.
Notice that, so far, this is in keeping with Spearman's general-factor theory. However,
many people are normal or strong in verbal reasoning, but weak in visual-spatial rea-
soning, and vice versa. The intellectual structure of these individuals is not explained
by a single, general factor but is better explained by a theory that suggests that these
two factors are independent and should vary according to individual cognitive
strengths and weaknesses. Of course, the more factors that are included in the the-
ory, the more complex the scenarios can become.
Thurstone's Primary Mental Abilities
An American psychologist, Louis L. Thurstone, from the University of Chicago, pro-
posed that a collection of mostly independent primary abilities underlay intelligence,
rather than the global general factor and multitude of specific factors proposed by
Spearman. Interestingly, one of the things we know today about factor analysis that
wasn't widely known 75 years ago is that the number of factors derived is in large
part due to the number and diversity of the input (i.e., items, subtests, tests). Using
the statistical technique multiple-factor analysis, Thurstone analyzed responses of
more than 200 college students to 56 ability tests and derived 13 mental factors. He
eventually settled on seven primary mental abilities, described in Table 10.1. It is
important to understand that even Thurstone admitted that these factors were not
Table 10.1 Thurstone's seven primary mental abilities
Ability Description
Verbal Comprehension (V) Assesses understanding and expression of ideas using language. (V) is measured by tasks
involving vocabulary, analogies, and reading comprehension.
Number (N) Assesses ability to solve numeric problems using basic math processes. (N) is measured by
tasks involving rapid, accurate computation of simple math problems, story problems, and
math calculation.
Word Fluency (W) Assesses fluency of speech and writing. (W) is measured by tasks such as anagrams and word
naming (e.g. words ending m-ing).
Spatial (S) Assesses ability to visualize patterns and rotate objects in space. (S) is measured by tasks in-
volving three-dimensional visualization, matrices, and block designs.
Reasoning (R) Assesses inductive thinking and problem solving. (R) is measured by tasks involving logic,
discerning a rule of operation or pattern, and number sequence patterns.
Memory (M) Assesses rote memorization of information. (M) is measured by tasks involving recall of sen-
tences, letters, digits, words, etc.
Perceptual Speed (P) Assesses ability to quickly note and discriminate visual details. (P) is measured by tasks in-
volving identification of similarities and differences in pictures or geometric objects.
326 Chapter 10
Table 10.2 Factors of the Horn-Cattell model
Designation
Name
Description
Gf
Gc
Gq
Gv
Ga
Gs
Gsm
Glr
Fluid intelligence
Crystallized intelligence
Quantitative ability
Visual processing
Auditory processing
Processing speed
Short-term memory
Long-term retrieval
Nonverbal reasoning, novel circumstances
General knowledge, verbal comprehension and reasoning
Understanding and problem solving using mathematical concepts and
symbols
Receiving and making decisions using visual and spatial stimuli
Receiving and making decisions using auditory stimuli
Ability to maintain attention and make quick, accurate decisions
Ability to maintain and use information over a short time period (seconds
to minutes)
Ability to encode and store information for retrieval and use over a long
time period (hours to years)
totally independent, and that any given intelligence test could measure one, several,
or even all of these dimensions. In fact, Thurstone developed the Primary Mental
Abilities intelligence test in 1938 to do just that. Unfortunately, Thurstone's own test
showed that several of the factors were highly correlated (e.g., the Verbal and
Reasoning factors correlated nearly r = 0.60), calling into question the independence
of these components of intelligence. Of course, critics of multiple-factor models were
quick to explain this observation by using Spearman's general-factor model. Perhaps
the most damaging contradiction of Thurstone's model is the inclusion of a total-
scale score for the Primary Abilities Test, an admission, although perhaps inadvertent,
that a general global factor has some interpretable meaning or predictive usefulness.
Horn-Cattell Cc/Cf model
Raymond Cattell (1943, 1963, 1971, 1979) proposed that intellectual abilities could
be divided into two broad categories or second-order factors. Fluid abilities (GO were
primarily inherited, perceptual capabilities thought to be mostly free of potential so-
ciocultural bias. Tests measuring visualization, nonverbal, and spatial reasoning capa-
bilities are direct assessments of fluid ability. Crystallized abilities (Gc) were primarily
learned, acquired knowledge and skills that were socioculturally laden and heavily in-
fluenced by formal and informal educational experiences. Tests measuring vocabulary,
general information, verbal abstract reasoning, and social comprehension directly as-
sess crystallized ability. Importantly, Cattell proposed that fluid and crystallized abil-
ities are significantly correlated, especially among those who share a common cultural
and educational background. Thus no pretense of factor independence was offered.
In 1966, Cattell and John Horn became the major proponents of this model,
and the model was expanded by Horn and his colleagues in subsequent years to add
on additional factors derived through rational and factor analytic studies of multiple
test batteries. Currently, the Horn-Cartcll model espouses eight components (sec
Table 10.2), many of which have more or less provided the theoretical underpin-
Assessment of Intelligence 327
Contents
Products
Operations
Figure 10.2 Guilford's structure-of-intellect model
nings of the Stanford-Binet Intelligence Scales, now in its fifth edition {SB-5) (Roid,
2003), and to a greater extent, the Woodcock-Johnson Tests of Cognitive Abilities —
Third Edition (WJ-III COG) (Woodcock, Mather, & McGrew, 2001).
Guilford's Structure-of-lntellect Model
Guilford (1967, 1988; Guilford & Hoepfner, 1971) also used factor analysis to dis-
cern a model of intellect but arrived at quite different conclusions than Spearman or
Vernon about the existence ofg, and he rejected Thurstone's argument of the exis-
tence of a number of independent primary mental abilities. Instead, Guilford pro-
posed a theory in which 3 dimensions gave rise to approximately 1 80 unique specific
factors (see Figure 10.2), as expressed within a 6 x 5 x 6 boxlike matrix. The first di-
mension, mental operations, indicates what an individual does and includes 6 com-
ponents: cognition, memory recording, memory retention, divergent production,
convergent production, and evaluation. The second dimension, contents, indicates
the materials upon which the individual performs various operations and includes 5
components: visual, auditory, symbolic, semantic, and behavioral. The final dimen-
sion, products, indicates the format into which individuals store and process informa-
tion and includes 6 facets: units, classes, relations, systems, transformations, and
328 Chapter 10
General intelligence (g)
2nd-Order
Factors
Major
Facets/
Factors
Specific
Facets/
Factors
VerbakEducational (v:ed)
Practical (k:m)
Verbal Quantitative
h^ri rT
Mechanical Spatial
Humphreys
Modification
for Nonassigned
Specific
Facets/
Factors
Figure 10.3 Hierarchical ability model proposed by Vernon and Humphrey
implications. Each of the resulting 180 cells may contain a specific factor or a com-
bination of specific factors, but each factor can be described in terms of its 3 com-
ponents. Guilford's model has had little impact on the standardized measurement of
intelligence, but nonetheless is a helpful model for understanding intelligence, par-
ticularly as applied to education.
Hierarchical Models
Vernon (1960, 1965) suggested a model of intelligence that in some ways is a com-
promise between the divergent theories proposed by Spearman and Thurstone.
Vernon agreed that g underlay all facets of intelligence but noticed that certain clus-
ters of various types of intelligence tests or subtests were too high to conclude thatg
was the only factor accounting for the relationship. He proposed that two second-
order factors comprised g, namely Verbal: Educational (v:ed) and Practical (k:m) ap-
titudes. From these second-order factors, various skill areas branch off, which may be
broken down into even lower-level facets (see Figure 10.3). For example, the Verbal:
Educational factor may be assessed using tests measuring verbal comprehension and
quantitative skill. Verbal comprehension skills may be further delineated and assessed
by tests measuring vocabulary development, social comprehension, general informa-
tion, and verbal abstract reasoning. These latter tests are more similar to the s factors
proposed by Spearman or the individual cells proposed by Guilford.
Assessment of Intelligence 329
Other hierarchical models, such as the one proposed by Humphreys (1962,
1970), argued for more flexibility in accounting for or assigning specific factors to
higher-level factors. For example, it can be argued that in testing one's ability to solve
analogies, it is helpful to use spatial, verbal, and numerical cues, each of which is rep-
resented by specific factors. While the practical and theoretical applications of hier-
archical models have allowed them to grow in popularity (Anastasi & Urbina, 1997),
a primary limitation remains the lack of empirical validation of the model (Sax,
1997).
Sternberg's Triarchic Theory: An Information Processing Approach
Sternberg (1988), using an information processing perspective, described a triarchic
model, so named because it was composed of three aspects (subtheoretical compo-
nents) of intelligence: componential (the person's internal world), experiential (the
person's external world and adaptation to novelty), and contextual (the person's exter-
nal world and environmental adaptation or creation). This theory arose from
Sternberg's (1986, p. 33) belief that intelligence involved "purposive adaptation to,
shaping of, and selection of real-world environments relevant to one's life." Sternberg
stated that available tests of intelligence failed to measure the complex processes pro-
posed in his theory. Sternberg's primary criticism of currently available intelligence
tests is that they measure primarily memory and analytical reasoning skills that are
useful in predicting school performance, predominantly because they are contextu-
alized to school and learning problems, are short, and have a single correct answer.
He believes these tests have little usefulness in predicting "real-world" performances
people encounter in the world of work; what some call practical intelligence.
In the componential subtheory, Sternberg identified three facets as being critical
to the efficiency with which individuals process information. Metacomponents allow
people to plan purposeful activities, self-monitor the implementation of these plans,
and self-evaluate the effectiveness of the implementation. These are higher-level cog-
nitive processes, sometimes called executive functioning, that help explain why some
very bright and talented people accomplish a lot and others accomplish very little.
According to Sternberg, the very intelligent person focuses on important tasks and
issues — what some refer to as the "big picture" — plans them out, and accomplishes
them. Less intelligent people focus on issues and situations that are less important —
what some call the "little picture." Performance components allow individuals to
process diverse information with varying degrees of efficiency by using mental skills
such as information retrieval, encoding, or comparing. Knowledge acquisition in-
volves an individual's capacity to select information relevant to a given problem con-
text and then to compare and combine it with other relevant information, leading
to insights, connections, and, eventually, new learning. Obviously, the more efficient
one is at making relevant connections and gaining necessary insights, the greater
one's capacity for learning (i.e., intelligence).
The experiential subtheory views intelligence as an interplay of experience and in-
formation processing. Thus, experienced individuals often appear more intelligent but
only because they have encountered a problem in the past and recall how to resolve it
330 Chapter 10
appropriately. According to Sternberg, novel situations present a level playing field to
determine adaptability and problem solving, because such circumstances favor those
who process information more quickly and efficiently. In this way, Sternberg valued
"automaticity," the ability to quickly learn information, processes, and procedures,
thus freeing up the resources necessary for adaptation to novel situations.
Finally, Sternberg's contextual subtheory involves adaptability in the external
world, the context for practical, pragmatic decision making that allows humans to
shape, adapt, and select environments in which to thrive. For example, we have all
known individuals who did not do well in school but had a knack for adapting to
new situations (contexts) and who do quite well for themselves. These individuals
read and adapt to the environmental context.
In 1994, Sternberg refined his theory by altering his terminology to include the
terms memory-analytic, synthetic-creative, and practical-contextual abilities. Sternberg
viewed memory-analytic functions as commonplace in education and science today,
where people construct defined and delimited problems with predictable and "cor-
rect" solutions. Synthetic-creative problems are those that are not entrenched in
common assumptions, such as when an illogical assumption is given and the exam-
inee is required to follow the assumption to its inevitable conclusion. Such out-of-
the-box thinking requires flexible cognitive and reasoning processes that are difficult
to teach, but which are nonetheless critical to creative problem solving. Practical-
contextual abilities, also termed tacit knowledge ox practical intelligence, was defined
as "action-oriented knowledge, acquired without direct help from others, that allows
individuals to achieve goals they personally value" (Sternberg, Wagner, Williams, &
Horvath, 1995, p. 916). Practical-contextual tasks help explain why some individu-
als who score low on traditional tests of intelligence are able to solve sometimes com-
plex everyday situations with more ease than their "more intelligent" counterparts.
As an application of Sternberg's theory, Table 10.3 contains the types of items de-
rived from a triarchic model.
Fundamental to Sternberg's theory is that intelligence is not set; it is malleable
and continually developing. Moreover, the display of an individual's intelligence can
vary from one context to another; that is, people may be absolutely brilliant when in
their "element" (i.e., the board room or chemistry lab), but substantially less so when
not (i.e., the kitchen or nursery).
Gardner's Multiple Intelligences
Howard Gardner (1983, 1993) rejected the existence of g and identified eight dis-
tinct intelligences that aid in an individual's adaptation to the environment. He de-
fined intelligence as the ability "to resolve general problems or difficulties as they are
encountered" (Gardner, 1983, p. 60) and identified the following eight intelligences:
(1) verbal-linguistic, (2) logical-mathematical, (3) spatial, (4) musical, (5) bodily-
kinesthetic, (6) interpersonal, (7) intrapersonal, and (8) naturalist (see Table 10.4).
Gardner criticized current tests of intelligence for being primarily measures of verbal,
spatial, and logical reasoning while ignoring other abilities that are, in some ways, so
Assessment of Intelligence 331
Table 10.3 Item types derived from Sternberg's triarchic model
Item type
Description
Componential: Verbal
Componential: Quantitative
Componential: Figural
Assesses a student's verbal ability when learning from relevant contexts, such as when a
word is used in the context of a sentence and a student is asked to infer the word's meaning
from context.
Assesses numerically based inductive reasoning abilities by extrapolating from sequences of
numbers. For example: When given the following sequence of numbers: 2, 4, 8, 16, ? :
the student would choose 32 from a list of possible answers.
Assesses inductive reasoning abilities through figure classifications and analogies. For
example:
O
B.
(b)
(c)
Coping With Novelty: Verbal
Coping With Novelty:
Quantitative
Assesses the ability to think in relatively novel ways using hypothetical thinking or novel
verbal analogies requiring counterfactual reasoning. For example: Assume snowflakes are
made of sand. Which solution is now correct, given the assumption? Water is to drop as
snow is to: (a) storm, (b) beach, (c) grain, (d) ice.
Assesses quantitative coping with novelty skills by using number matrix items, but with an
element of novelty. Usually, items involve symbols used in place of certain numbers and
require the examinee to make a number substitution. For example:
12
Coping with novelty: Figural
(a) 14, (b) 4, (c) 17, (d) 8.
Assesses a student's ability to complete a pictorial series in a "newly mapped domain," (not
the domain in which the student has constructed or inferred the rule). For example,
A. A
□
B.
□
continued
332 Chapter 10
Table 10.3 continued
Item type
Description
(a)
( c, D
(d)
Automatization: Verbal
Automatization: Quantitative
Automatization: Figural
Assesses rapid decisions of a verbal nature. For example, are the following letters from the
"same" category (both vowels, both consonants) or "different" categories (vowel or
consonant): "b, n" (same); "e, m" (different); "u, o" (same); "g, i" (different).
Assesses rapid decisions of a quantitative nature. For example, are the following numbers
from the "same" category (both odd, both even) or "different" categories (odd or even): "2,
4" (same); "9, 6" (different); "7, 3" (same); "8, 5" (different).
Assesses rapid decisions of a figural nature. For example, do the following figures have the
"same" or "different" numbers of sides?
c.
Practical: Verbal
Practical: Quantitative
Practical: Figural
Assesses practical, everyday problem-solving abilities requiring verbal inferential reasoning.
For example: The sign at Bill's Market reads, "The lowest meat prices in town." If the ad is
for real, which of the following is most likely true?
(a) Bill's Market charges more than Sam's.
(b) No other market charges less than Bill's.
(c) Bill is a successful businessman.
(d) Bill's is the busiest market in town.
Assesses practical, everyday problem-solving abilities requiring quantitative reasoning. For
example: Given a recipe for making two dozen cookies and an inventory of ingredients
tin rent ly in the house, the examinee may be asked, "How many dozen cookies could be
baked without having to go to the store for more supplies?"
Assesses practical, everyday problem-solving abilities requiring figural reasoning. For
example: A student may be shown a town map and be asked to chart the shortest route
from one place in the town to another.
much more important in adapting to the environment and solving real-world prob-
lems. For example, intelligence tests rarely identify outstanding musical, athletic, or
intrinsic motivation potential. Gardner's relatively independent intelligences were
Assessment of Intelligence 333
Table 10.4 Howard Gardner's multiple intelligences
Intelligence
Description
Linguistic
Logical-Mathematical
Spatial
Musical
Bodily-Kinesthetic
Interpersonal
Intrapersonal
Naturalist
The ability to use language to express ideas and understand others. Linguistic intelligence is
displayed by lawyers, teachers, orators, writers, and linguists.
The ability to understand underlying causal systems, inductive and deductive logic,
scientific reasoning, numerical reasoning, and numerical operations. Logical-mathematical
intelligence is displayed by mathematicians, logicians, scientists, and engineers.
The ability to understand, visualize, and manipulate mental images, graphic
representations, or objects in space. Spatial intelligence is displayed by sculptors, painters,
surgeons, architects, and navigators.
The ability to think musically and rhythmically by hearing, remembering, and
manipulating patterns. Musical intelligence is displayed by musicians of any kind.
The ability to use one's body to solve complex motor problems through awareness and
control of motor functions. Bodily-kinesthetic intelligence is displayed by athletes, dancers,
actors, and seamstresses.
The ability to understand and work with other people, read their verbal and nonverbal
communication, be sensitive to the feelings of others, and solve problems of an
interpersonal nature. Interpersonal intelligence is displayed by professional counselors,
salespeople, managers, politicians, and just about anyone else who has to deal with people
problems.
The ability to understand oneself; what one can do, can't do, self-motivations, propensities,
and aversions. Intrapersonal intelligence involves metacognition, self-awareness, and
abstract thinking. It relies on self-awareness and is important in virtually any endeavor.
The ability to discriminate among and classify objects. Naturalist intelligence is displayed
by farmers, botanists, hunters, and chefs.
identified through a process that involved several criteria, including occurrence
across cultures, the effects oflocalized brain damage, and the distinct history of ex-
ceptional ability.
While Gardner does not dispute the importance of genetics, he clearly points
out that intelligence stems from an interaction between heredity and environment.
For example, consider a case in which two children of equal musical talent are born
into two separate families. The first family values musical talent and expends great
time and effort to cultivate Johnny's burgeoning skills. The second family not only
doesn't value musical talent, but actively punishes its expression whenever possible,
frequently telling the child, "Stop playing with violins and cellos, Jimmy. You'll have
no need of them in your career as a professional counselor!" Certainly, the odds of
developing substantial musical intelligence are in Johnny's favor. Gardner's theory is
thought provoking and has received much attention in classrooms and schools
around the United States. Unfortunately, there are numerous problems when trying
to measure several of the intelligences, and the empirical support behind the theory
is less than robust.
334 Chapter 10
Some Final Thoughts on the (Practical) Nature of Intelligence
Richard Hernstein and Charles Murray (1994), in their very controversial book The
Bell Curve: Intelligence and Class Structure in American Life, categorized the theories
proposed by Spearman, Binet, and their contemporaries as classicist. The common
thread to classicist models was the adherence to a unifying factor, g, at the center of
intellectual being. Another broad category proposed by Hernstein and Murray was
the revisionist models. Revisionist theories proposed that there was indeed a unifying
factor, g, at the center of cognitive structure, but g was composed of several second-
ary factors (i.e., verbal reasoning, nonverbal reasoning, working memory, processing
speed), each of which contributes to one's total cognitive makeup. Furthermore, re-
visionist models assert that individual clients can have strengths and weaknesses in
each of these processing categories. In the end, it is the combination of these
strengths, weaknesses, or normal capacities that make up one's total cognitive func-
tion (g). However, various patterns or combinations of cognitive skills, while perhaps
resulting in the same estimate of overall intelligence (g), may lead to very different
results in terms of bow problems are solved. For example, when required to write an
extensive report about some social phenomenon, a client with excellent verbal rea-
soning skills and poor nonverbal reasoning skills may be able to excel at the task,
while a client with the identical overall IQ (g), but with low verbal reasoning skills
and outstanding nonverbal reasoning skills, may struggle mightily. Some well-known
revisionist models include those of Vernon, Horn-Cattell, and Guilford. To this day,
psychometricians and statisticians continue to debate whether intelligence can be
meaningfully represented by a single global score (the classicist position) or is global
with multidimensional refinements (the revisionist position).
Hernstein and Murray (1994) referred to a third movement within the field of
intellectual assessment as the radicals, and pointed to Gardner's theory of multiple
intelligences as a prime example. Gardner rejects the existence of g and lauds the in-
dependence of the intelligences he has identified. While appealing in its own right
and widely used in education, little empirical support for the independent nature of
the identified intelligences exists.
While development, classification, and description of intelligence tests is cer-