Michael Thomas Flanagan's Java Scientific Library

PCA_Analysis:     An Application Performing a Principal Component Analysis

     

                                                                                                                                                                                                                                Last update: 4 December 2010


This application performs a basic Principal Components Analysis with a Varimax Rotation on data provided by the user.

The data may be supplied as numerical scores or alphabetic scores which are the responses of several individuals (refererred to as persons on this page) to several questions (referred to as items on this page).
Alphabetic scores will be converted to numerical scores as described below.
Options are offered for handling missing responses, also described below.

The application performs the following:
This application illustrates the use of methods in the PCA class.

INSTALLING AND RUNNING THE APPLICATION PCA_Analysis

This page contains details of:

INSTALLING PCA_Analysis

The Java Development Kit Platform 6 must be installed on your computer or network.
This application creates an instance of, and calls methods from, the PCA class facilitating an easily performed basic Principal Component Analysis. The PCA class is part of the Michael Thomas Flanagan Library. The Michael Thomas Flanagan Library file, flanagan.jar, must be downloaded and installed in the appropriate directory (see Michael Thomas Flanagan Library Main Page).

Download the source file PCA_Analysis.java into an appropriate folder.
Compile PCA_Analysis, e.g on PC with a Microsoft Windows XP Operating System:

PREPARING THE DATA FILE

Prepare the input data file. The data file may be stored in any directory. It is not necessary to store it in the same directory as PCA_Analysis but such storage may be convenient.
The data file must be a text file of the one of the two following formats:

Format one: scores entered as item responses by an individual person, entered as a row

     data title
     number of items
     number of persons
     item names (one word each), as a row, e.g.    item1      item2  . . .   itemn
     response of person 1 to item 1      response of person 1 to item 2  . . .   response of person 1 to the nth item (all on one line)
     response of person 2 to item 1      response of person 1 to item 2  . . .   response of person 1 to the nth item (all on one line)
     . . . .
     response of person m to item 1      response of person 1 to item 2  . . .   response of person 1 to the nth item (all on one line)

where there are n items and m persons.

The item names must be single words. Each response may be a floating point number, an integer number, a single word or a single letter. The item names and responses must be separated from any preceding and/or any following number or word by a single space or several spaces, a comma, a tab, a semicolon, colon or end of line. See Response Representation (below) for a detailed description of allowed response representations. See Missing Response (below under Response Representation) for a detailed description of how a missing response may be represented. All responses for an individual person must be on the same line.

or

Format two: scores entered as responses to an individual item by the persons responding, entered as a row

     data title
     number of items
     number of persons
     item names (one word each), as a row, e.g.    item1      item2  . . .   itemn
     response to item 1 by person 1      response to item 1 by person 2  . . .   response to item 1 by the mth person (all on one line)
     response to item 2 by person 1      response to item 2 by person 2  . . .   response to item 2 by the mth person(all on one line)
     . . . .
     response to item n by person 1      response to item n by person 2  . . .   response to item n by the mth person (all on one line)

where there are n items and m persons.

The item names must be single words. Each response may be a floating point number, an integer number, a single word or a single letter. The item names and responses must be separated from any preceding and/or any following number or word by a single space or several spaces, a comma, a tab, a semicolon, colon or end of line. See Response Representation (below) for a detailed description of allowed response representations. See Missing Response (below under Response Representation) for a detailed description of how a missing response may be represented. All responses for an item must be on the same line.

Example data files may be found on Example Programs

RESPONSE REPRESENTATION

Responses
Responses may be entered as:
The response input methods are case insensitive. Response types may be mixed within a data file but should be of the same type within an individual item. See Example Programs for examples of mixed type data files.
Non-numerical representations of responses are converted to numerical values as follows:

Missing Responses
A missing response may be represented by any word or letter, preferably a word, e.g. abs or missing, not listed above as a valid response. If a missing response is represented by a word, eg, abs, missing, any of the separators, used to separate the responses in the data file, i.e. space, comma, a tab, a semicolon, colon or end of line, may be used. If a missing response is represented by a space that space MUST be preceded and followed by a comma, a tab, a semicolon, colon or end of line, i.e. in this case a space cannot also be used as a separator.
See box three, box four and box five (below) for the options on dealing with a missing response in the alpha coefficient calculations.

Example data file: PCA_DataOne.txt, using spaces as separators.
Example data file: PCA_DataTwo.txt, using spaces as separators with missing responses.
Example data file: PCA_DataThree.txt, using commas as separators and spaces for missing responses.
The data files are described in detail in Example Programs.

RUNNING PCA_Analysis

Run PCA_Analysis, e.g on PC with a Microsoft Operating System: A series of information or dialogue boxes will then appear sequentially. All you need to do is respond`to each box in turn. Pressing the ‘enter’ key will close the box selecting the default option, i.e. the button with the bold outline or the value or text in the text box.

Box one: Information box
The first box is an information message identifying the Program that you have initiated. Click on the OK button when you have read the message.

Box two: Identifying data format
The second box is a dialogue box asking whether the data in the input file is organised as
scores entered as item responses by an individual person, entered as a row (format one above)
or
scores entered as responses to an individual item by the persons responding, entered as a row (format two above)
Click on the appropriate button

Box three: Missing responses: replacement option
This dialogue box requests you to select an option for dealing with missing responses. The options are:
Click on the appropriate button
See also box four and box five

Box four: Missing responses: person deletion options
This input box requests you enter the person deletion percentage (pdpc), i.e. the percentage of missing responses in an individual person's responses that is tolerated. If that person has a greater percentage of missing responses that person will be deleted from the analysis, e.g.
A value of 0.0 will lead to a person being deleted on missing a single response.
A value of 50.0 will lead to a person being deleted on missing more than 50% of the response.
A value of 100.0 will ensure that a person is only deleted if that person fails to make any responses.
See also box three and box five

Box five: Missing responses: item deletion options
This input box requests you enter the item deletion percentage (idpc), i.e. the percentage of missing responses to an individual item that is tolerated. If that item has a greater percentage of missing responses that item will be deleted from the analysis, e.g.
A value of 0.0 will lead to an item being deleted on one person missing a response to that item.
A value of 50.0 will lead to an item being deleted on more than 50% of individual persons failing to respond to that item.
A value of 100.0 will ensure that an item is only deleted if no persons respond to that item.
See also box three and box four

Box six: Selection of the input data file
This file slection window allows you to select the data file you wish to analyse. This window opens displaying the contents of the current directory, i.e. the directory in which you have stored PCA_Analaysis.java, but you can use this window to browse any directory on your computer if you have not stored your data files in the current directory.

Box seven: Selection of the output file type
This dialogue box requests you to select the type of output file that you require. The options are: The output file contains the following PCA analysis results:

See box eight for the output file names.
See Example Programs for an example of an output file.

Box eight: Request for the output file name
This input box requests you to enter the name of output file. The default name is the name of the input file with Analysis added as a suffix, e.g. an input file named PCA_DataOne.txt gives a default name for the output file as PCA_DataOneAnalysis.txt.

The program may take several seconds, after the output file name has been entered, before the next graph and dialogue box are displayed.

Box nine: Scree Plot and Varimax Rotation
This box is accompanied by a Scree plot graph. The Scree plot graph displayes
The dialogue box lists the ordered eigenvalues with the corresponding Monte Carlo simulation means and percentiles in parenthesis.
The dialogue box requests the number of factors you wish to extract for a varimax rotation procedure. The default value is based an a comparison of the data eigenvalues with the Monte Carlo simulation.

Box ten: Information box and closure request
This dialogue box gives the name of the output file.
It also asks if you want to close the Scree plot graph. If you choose to leave the Scree Plot displayed you need to end the program later by clicking on the close icon (white cross on red background in the top right hand corner) on the plot, or if using a Microsoft operating system, typing Control C in the command prompt window.

Clicking on the NO button ends the program. The output files are created in the directory in which you compiled PCA_Analysis unless you included an alternative path in a supplied output file name.

EXAMPLE PROGRAMS

No Missing Responses
Example Program Data File
The example data file has the following lines:

    a title [PCA Example Data One]
    responses to 7 items [7]
    responses from 23 individuals [23]
    a row of the item names, simply called item1 ...., in this example
    23 rows of the responses of each individual person to the 7 items
        The responses to item1 are within an integer range 30 to 45 inclusive
        The responses to item2 are within an integer range 1 to 5 inclusive
        The responses to item3 are true or false
        The responses to item4 are A, B, C, D or E
        The responses to item5 are either 1 or 2
        The responses to item6 are either yes or no
        The responses to item7 are within a floating point range -2.6 to 8.3 inclusive

This data file may be accessed through PCA_DataOne.txt.

Example Program Output File
The output file, produced on running the PCA_Analysis application with the above input data, PCA_DataOne.txt, may be accessed through PCA_DataOneAnalysis.txt

With Missing Responses
The data file PCA_DataTwo.txt contains missing data indicated by the word abs or the word missing.

The output file, produced on running the PCA_Analysis application with the input data, PCA_DataTwo.txt, and with: may be accessed through PCA_DataTwoAnalysis.txt




BIBLIOGRAPHY

Cohen, L., Manion, L. & Morrison, K, A. (2008), Research Methods in Education, 6th Edition, Routledge, London & New York, Chapter twenty five, Multidimensional measurement and factor analysis, pp 559-585.

Harman, H. H. (1976), Modern Factor Analysis, 3rd Edition Revised, The University of Chigago Press, Chicago & London.

See also PCA class, the class underpinning this application, for a more detailed description of the methods called by this application.



CLASSES IN THIS LIBRARY USED BY THIS APPLICATION

This application uses the following classes in this library:




This page was prepared by Dr Michael Thomas Flanagan