This application performs a basic Principal Components Analysis with a Varimax Rotation on data provided by the user.
The data may be supplied as numerical scores or alphabetic scores which are the responses of several individuals (refererred to as persons on this page) to several questions (referred to as items on this page).
Alphabetic scores will be converted to numerical scores as described below.
Options are offered for handling missing responses, also described below.
The application performs the following:
Calculation of the eigenvalues and their percentage of the total.
A parallel Monte Carlo simulation.
Parallel analysis of the eigenvalue means, standard deviations and percentiles.
Joint display of Scree plots for both the entered data and the simulated data.
Presentation of the covariance matrix
Presentation of the correlation matrix
Presentation of the eigenvectors
Calculation of the loading factors with corresponding eigenvalues and their proportions and cumulative percentages
Calculation of the Varimax rotated extracted loading factors with corresponding rotated eigenvalues and their proportions and cumulative percentages
This application illustrates the use of methods in the PCA class.
INSTALLING AND RUNNING THE APPLICATION PCA_Analysis
The Java Development Kit Platform 6 must be installed on your computer or network.
This application creates an instance of, and calls methods from, the PCA class facilitating an easily performed basic Principal Component Analysis. The PCA class is part of the Michael Thomas Flanagan Library. The Michael Thomas Flanagan Library file, flanagan.jar, must be downloaded and installed in the appropriate directory (see Michael Thomas Flanagan Library Main Page).
Download the source file PCA_Analysis.java into an appropriate folder.
Compile PCA_Analysis, e.g on PC with a Microsoft Windows XP Operating System:
Open up the Command Prompt Window
Change to the directory in which you have stored PCA_Analysis.java, e.g. type cd c:\PCA_Analyses where PCA_Analyses is the name of that folder on the C drive.
Compile, i.e. type javac PCA_Analysis.java followed by a return
PREPARING THE DATA FILE
Prepare the input data file. The data file may be stored in any directory. It is not necessary to store it in the same directory as PCA_Analysis but such storage may be convenient.
The data file must be a text file of the one of the two following formats:
Format one: scores entered as item responses by an individual person, entered as a row
data title
number of items
number of persons
item names (one word each), as a row, e.g. item1 item2 . . . itemn
response of person 1 to item 1 response of person 1 to item 2 . . . response of person 1 to the nth item (all on one line)
response of person 2 to item 1 response of person 1 to item 2 . . . response of person 1 to the nth item (all on one line) . . . .
response of person m to item 1 response of person 1 to item 2 . . . response of person 1 to the nth item (all on one line)
where there are n items and m persons.
The item names must be single words. Each response may be a floating point number, an integer number, a single word or a single letter. The item names and responses must be separated from any preceding and/or any following number or word by a single space or several spaces, a comma, a tab, a semicolon, colon or end of line. See Response Representation (below) for a detailed description of allowed response representations. See Missing Response (below under Response Representation) for a detailed description of how a missing response may be represented. All responses for an individual person must be on the same line.
or
Format two: scores entered as responses to an individual item by the persons responding, entered as a row
data title
number of items
number of persons
item names (one word each), as a row, e.g. item1 item2 . . . itemn
response to item 1 by person 1 response to item 1 by person 2 . . . response to item 1 by the mth person (all on one line)
response to item 2 by person 1 response to item 2 by person 2 . . . response to item 2 by the mth person(all on one line) . . . .
response to item n by person 1 response to item n by person 2 . . . response to item n by the mth person (all on one line)
where there are n items and m persons.
The item names must be single words. Each response may be a floating point number, an integer number, a single word or a single letter. The item names and responses must be separated from any preceding and/or any following number or word by a single space or several spaces, a comma, a tab, a semicolon, colon or end of line. See Response Representation (below) for a detailed description of allowed response representations. See Missing Response (below under Response Representation) for a detailed description of how a missing response may be represented. All responses for an item must be on the same line.
floating point numbers, e.g. 2.34, 5.8, -8.91, 0.635 . . . [Use the E format for very large and very small numbers, e.g. 1.56E+17 for 1.56x1017, -7.854E-09 for -7.854x10-9]
single letters, e.g. A, B, C, D . . . , a, b, c, d, . . .
true or false, True or False, TRUE or FALSE
yes or no, Yes or No, YES or NO
The response input methods are case insensitive. Response types may be mixed within a data file but should be of the same type within an individual item. See Example Programs for examples of mixed type data files.
Non-numerical representations of responses are converted to numerical values as follows:
NO, No, and no —> -1.0
N and n, if part of a YN dichotomous pair [N Y, n y, N y or nY] —> -1.0 otherwise —> 14.0
YES, Yes, or yes —> +1.0
Y and y, if part of a YN dichotomous pair [N Y, n y, N y or nY] —> +1.0 otherwise —> 21.0
FALSE, False, and false —> -1.0
TRUE, True, and true —> +1.0
A and a —> 1.0, B and b —> 2.0, C and c —> 3.0, . . . etc.
Missing Responses
A missing response may be represented by any word or letter, preferably a word, e.g. abs or missing, not listed above as a valid response. If a missing response is represented by a word, eg, abs, missing, any of the separators, used to separate the responses in the data file, i.e. space, comma, a tab, a semicolon, colon or end of line, may be used. If a missing response is represented by a space that space MUST be preceded and followed by a comma, a tab, a semicolon, colon or end of line, i.e. in this case a space cannot also be used as a separator.
See box three, box four and box five (below) for the options on dealing with a missing response in the alpha coefficient calculations.
Example data file: PCA_DataOne.txt, using spaces as separators.
Example data file: PCA_DataTwo.txt, using spaces as separators with missing responses.
Example data file: PCA_DataThree.txt, using commas as separators and spaces for missing responses.
The data files are described in detail in Example Programs.
RUNNING PCA_Analysis
Run PCA_Analysis, e.g on PC with a Microsoft Operating System:
Open up the Command Prompt Window
Change to the directory in which you have stored PCA_Analysis.java, e.g. type cd c:\PCA_Analyses where PCA_Analyses is the name of that folder on the C drive.
Run, i.e. type java PCA_Analysis followed by a return
A series of information or dialogue boxes will then appear sequentially. All you need to do is respond`to each box in turn. Pressing the ‘enter’ key will close the box selecting the default option, i.e. the button with the bold outline or the value or text in the text box.
Box one: Information box
The first box is an information message identifying the Program that you have initiated. Click on the OK button when you have read the message.
Box two: Identifying data format
The second box is a dialogue box asking whether the data in the input file is organised as
scores entered as item responses by an individual person, entered as a row (format one above)
or
scores entered as responses to an individual item by the persons responding, entered as a row (format two above)
Click on the appropriate button
Box three: Missing responses: replacement option
This dialogue box requests you to select an option for dealing with missing responses. The options are:
1. the missing response is replaced by zero
2. the missing response is replaced by the mean of that person's respones
3. the missing response is replaced by the mean of the responses to that item. This is the default option
4. the missing response is replaced by the overall mean
5. the missing response is replaced by a user supplied score for each missing response. A value will be requested, via a dialogue box, each time a missing response is encounterd as the data is processed
Box four: Missing responses: person deletion options
This input box requests you enter the person deletion percentage (pdpc), i.e. the percentage of missing responses in an individual person's responses that is tolerated. If that person has a greater percentage of missing responses that person will be deleted from the analysis, e.g.
A value of 0.0 will lead to a person being deleted on missing a single response.
A value of 50.0 will lead to a person being deleted on missing more than 50% of the response.
A value of 100.0 will ensure that a person is only deleted if that person fails to make any responses.
See also box three and box five
Box five: Missing responses: item deletion options
This input box requests you enter the item deletion percentage (idpc), i.e. the percentage of missing responses to an individual item that is tolerated. If that item has a greater percentage of missing responses that item will be deleted from the analysis, e.g.
A value of 0.0 will lead to an item being deleted on one person missing a response to that item.
A value of 50.0 will lead to an item being deleted on more than 50% of individual persons failing to respond to that item.
A value of 100.0 will ensure that an item is only deleted if no persons respond to that item.
See also box three and box four
Box six: Selection of the input data file
This file slection window allows you to select the data file you wish to analyse.
This window opens displaying the contents of the current directory, i.e. the directory in which you have stored PCA_Analaysis.java, but you can use this window to browse any directory on your computer if you have not stored your data files in the current directory.
Box seven: Selection of the output file type
This dialogue box requests you to select the type of output file that you require. The options are:
Text File (.txt)
Excel Readable File (.xls)
This file can be read by Microsoft Excel as if it were an Excel file. Excel will nonetheless ask you to confirm that you do wish Excel to read this file.
The output file contains the following PCA analysis results:
Title
Name of input file if data read from a text file
Time and date of program execution
Eigenvalues
Ordered eigenvalues
Eigenvalues as a percentage of the total
Cumulative percentage of the eigenvalues
Components with eigenvalues greater than or equal to unity
Components with eigenvalues greater than parallel analysis mean
Components with eigenvalues greater than parallel analysis percentile
Parallel analysis eigenvalue means
Parallel analysis eigenvalue standard deviations
Parallel analysis eigenvalue percentiles
Covariance matrix
Correlation matrix
Partial correlation matrix
Eigenvectors
Loading factors with corresponding eigenvalues and their proportions and cumulative percentages
Rotated extracted loading factors with corresponding rotated eigenvalues and their proportions and cumulative percentages
The value of the overall Kaiser-Meyer-Olkin [KMO] statistic
The values of the individual item KMOs
The value of the Bartlett Sphericity Test Chi-Square
The value of the Bartlett Sphericity Test probability
The value of the Bartlett Sphericity Test degrees of freedom
Box eight: Request for the output file name
This input box requests you to enter the name of output file. The default name is the name of the input file with Analysis added as a suffix, e.g. an input file named PCA_DataOne.txt gives a default name for the output file as PCA_DataOneAnalysis.txt.
The program may take several seconds, after the output file name has been entered, before the next graph and dialogue box are displayed.
Box nine: Scree Plot and Varimax Rotation
This box is accompanied by a Scree plot graph. The Scree plot graph displayes
Data eigenvalues against component number
Parallel analysis eigenvalue means, against component number, with standard deviation error bars
Parallel analysis eigenvalue percentiles
The dialogue box lists the ordered eigenvalues with the corresponding Monte Carlo simulation means and percentiles in parenthesis.
The dialogue box requests the number of factors you wish to extract for a varimax rotation procedure. The default value is based an a comparison of the data eigenvalues with the Monte Carlo simulation.
Box ten: Information box and closure request
This dialogue box gives the name of the output file.
It also asks if you want to close the Scree plot graph.
If you choose to leave the Scree Plot displayed you need to end the program later by clicking on the close icon (white cross on red background in the top right hand corner) on the plot, or if using a Microsoft operating system, typing Control C in the command prompt window.
Clicking on the NO button ends the program.
The output files are created in the directory in which you compiled PCA_Analysis unless you included an alternative path in a supplied output file name.
EXAMPLE PROGRAMS
No Missing Responses Example Program Data File
The example data file has the following lines:
a title [PCA Example Data One]
responses to 7 items [7]
responses from 23 individuals [23]
a row of the item names, simply called item1 ...., in this example
23 rows of the responses of each individual person to the 7 items
The responses to item1 are within an integer range 30 to 45 inclusive
The responses to item2 are within an integer range 1 to 5 inclusive
The responses to item3 are true or false
The responses to item4 are A, B, C, D or E
The responses to item5 are either 1 or 2
The responses to item6 are either yes or no
The responses to item7 are within a floating point range -2.6 to 8.3 inclusive
Example Program Output File
The output file, produced on running the PCA_Analysis application with the above input data, PCA_DataOne.txt, may be accessed through PCA_DataOneAnalysis.txt
With Missing Responses
The data file PCA_DataTwo.txt contains missing data indicated by the word abs or the word missing.
The output file, produced on running the PCA_Analysis application with the input data, PCA_DataTwo.txt, and with:
the missing response replacement option: replace a missing data point by the item mean
Cohen, L., Manion, L. & Morrison, K, A. (2008), Research Methods in Education, 6th Edition, Routledge, London & New York, Chapter twenty five, Multidimensional measurement and factor analysis, pp 559-585.
Harman, H. H. (1976), Modern Factor Analysis, 3rd Edition Revised, The University of Chigago Press, Chicago & London.
See also PCA class, the class underpinning this application, for a more detailed description of the methods called by this application.
CLASSES IN THIS LIBRARY USED BY THIS APPLICATION
This application uses the following classes in this library: