Implementing a statistical data type in IRIS Explorer

Patrick Craig

NAG Ltd, Wilkinson House,
Jordan Hill Road, Oxford OX2 8DR, UK

AbstractThis paper is aimed at IRIS Explorer users who want to create their own data types. It is intended to be read in conjunction with the information given in the Creating User-defined Data Types chapter of the IRIS Explorer Module Writer's Guide. A new data type for handling statistical data is specified and the procedure for implementing and using the new type is described.

1. Introduction

IRIS Explorer is a powerful scientific visualisation system that is currently aimed at computational physicists, chemists and engineers [1]. The IRIS Explorer data types are therefore designed to hold the data structures used by these workers. However, IRIS Explorer was never intended to be a closed system and as well as being able to create new modules using the existing IRIS Explorer types, users can create their own data types to handle unsupported data structures. The work described in this paper is part of an ongoing project to integrate the functionality of the Genstat statistical package into IRIS Explorer. Genstat is a very general statistics program that includes facilities for data management and manipulation, statistical analysis and graphical display In SS2 the new data type is described and specified as an IRIS Explorer type definition file. Section 3 describes how the type definition file is processed to produce the files required to use the type. In SS4 the automatically generated Application Programming Interface (API) for the type are described and example C code using the API functions is provided.

2. Specification of the type.

A data type was required that could hold the basic data structures that are used in the Genstat statistical package. These are variables that consist of an identifier (name) and a one-dimensional array of values. There are three types of variable that differ in the way the values are interpreted. A typical data set is made up of a number of variables of one or more types and each observation in the data set is represented by the values of each variable at a given position in the variable arrays. A data set can therefore be thought of as a variable by observation two-dimensional matrix. In section 2.1 the three types of Genstat variable are described and section 2.2 gives an example of how a data set could be stored in these variable types. The IRIS Explorer type definition file for the data type is described in section 2.3.

2.1 The values stored by the three variable types

variate

The values of a variate are integer or floating point numbers. variates are normally used to store quantitative data.

text

text values are strings that are used as observation identifiers.

factor

factors are used to group points into subsets of the total data set. The values of a factor are therefore restricted to a limited set of possible levels. Each level has an identifier or label.

2.2 An example data set

River           Length  Continent

Nile            6695    Africa
Amazon          6570    S.America
Mississippi     6020    N.America
Yangtze         5471    Asia
Ob              5410    Asia
'Huang He'      4840    Asia
Zaire           4630    Africa
Amur            4415    Asia
Lena            4269    Asia
Mackenzie       4240    N.America
Niger           4183    Africa
Mekong          4180    Asia
Yenisey         4090    Asia
Murray          3717    Oceania
Volga           3688    Europe

This data set shows the 15 longest rivers in the world. The three columns in this data set represent three data structures. Each column is headed by its respective identifier, subsequent rows represent observations in the data set, in this case rivers. The first column gives the name of the river and could be stored in a text structure called River. The second column would be stored in a variate called Length. The third column is an example of a factor called Continent with six levels.

The river data set could therefore be stored in the following three variables

Variable 1

Type = Text

Identifier = River

Values = Nile, Amazon, Mississippi, Yangtze, Ob, 'Huang He', 
                Zaire, Amur, Lena, Mackenzie, Niger, 
                Mekong, Yenisey, Murray, Volga



Variable 2

Type = Variate

Identifier = Length

Values = 6695, 6570, 6020, 5471, 5410, 4840, 
                4630, 4415, 4269, 4240, 4183, 
                4180, 4090, 3717, 3688



Variable 3

Type = Factor

Identifier = Continent

Values = 0,1,2,3,3,3,0,3,3,2,0,3,3,4,5

Labels = Africa, S.America, N.America, Asia, Oceania, Europe

Each of the variable types could be defined as an individual IRIS Explorer data type. However, as many of the modules that would use the new data type would be able to use data in two or all of the above forms and to reduce the number of connections between modules it was decided to create a single data type that could hold all three data structures. The first step in creating a new data type in IRIS Explorer is to create a data type definition file that describes the type in a format that IRIS Explorer can understand. The type definition file for gnBase (Genstat basic type) is shown below.

2.3 The type definition file, gnBase.t

#include <cx/DataCtlr.h>
#include <cx/Typedefs.t>

typedef enum {
    gn_Variate,
    gn_Factor, 
    gn_Text
} gnPrimType;

shared typedef struct {
    long len                            "Length";
    string identifier                   "Identifier";
    gnPrimType gnType                   "Type";
    switch (gnType) {
        case gn_Variate :
            double values[len]          "Values";
        case gn_Text :
            string values[len]          "Values";
        case gn_Factor :
            long values[len]            "Values";
            long levels                 "Levels";
            string labels[levels]       "Labels";
    } d;
} gnData;

shared root typedef struct {
    long nVar                           "Num variables";
    gnData data[nVar]                   "Data array"; 
} gnBase;

The gnBase structure is declared as a shared root structure with two elements, nVar, the number of variables, and data, the variable array. It is declared as a root structure so that it can be used as an input and output port data type in IRIS Explorer. The shared attribute of gnBase means that the data structure will be shared between modules and allocation and deallocation of the memory used for the structure will be controlled in IRIS Explorer by reference counting. The gnBase variable array is an array of gnData which is declared above it.

The gnData structure stores a single variable. Its elements are len, the number of values, identifier, the variable identifier, gnType, variable type, and values, the one dimensional array of values. The switch construct is used to set the type of the values array depending on variable type. The factor type has two additional elements, namely levels, the number of levels, and labels, the labels for the levels. The gnData structure is also shared because it will be shared between modules, but is not a root type because it was decided to only pass the complete gnBase structure between modules.

3. Implementing the type

In this section, the process by which a new type is implemented on a UNIX operating system is described. This process has been simplified for the Windows NT operating system [3].

The type definition file is translated into the files required to use the new type by creating a text file called TYPES containing the single word gnBase in the same directory as gnBase.t and executing the IRIS Explorer makefile creation utility, cxmkmf. This creates the Makefile and executing the make command creates the files listed in section 3.1. To make the new type available to IRIS Explorer, the type has to be installed as described in section 3.2.

3.1 Files generated from gnBase.t

gnBase.3
Unformatted man page
gnBase.api.c
C API code
gnBase.api.h
C API header
gnBase.api.inc
FORTRAN API include
gnBase.api.o
C API object file
gnBase.fapi.c
FORTRAN API header file
gnBase.fapi.h
FORTRAN API header file
gnBase.fapi.i
FORTRAN API wrapper include
gnBase.fapi.i_f.c
FORTRAN API wrapper code
gnBase.fapi.o
FORTRAN API wrapper object file
gnBase.global.c
Meta type definition
gnBase.global.o
Meta type definition object
gnBase.h
Type header file
gnBase.inc
FORTRAN include
gnBase.meta.c
Meta type code
gnBase.meta.o
Meta type object file
gnBase.out
Program for generating gnBase.type
gnBase.type
Binary version of gnBase.meta.o
gnBase.z
Compressed formatted man page
libgnBase.a
Library containing object files

The C equivalent of gnBase.t, gnBase.h

#ifndef __GNBASE_H_
#define __GNBASE_H_

/*
* Translated by cxtyper Tue Dec 3 17:13:31 1996
*/

#include <cx/DataCtlr.h>

typedef enum {
    gn_Variate,
    gn_Factor,
    gn_Text
} gnPrimType;

typedef struct gnData {
    cxDataCtlr          ctlr;
    long                len;
    char                *identifier;
    gnPrimType          gnType;
    union {
        struct {
            double      *values;
        } gn_Variate;
        struct {
            char        **values;
        } gn_Text;
        struct {
            long        *values;
            long        levels;
            char        **labels;
        } gn_Factor;
    } d;
} gnData;

typedef struct gnBase {
    cxDataCtlr          ctlr;
    long                nVar;
    gnData              **data;
} gnBase;

#endif

The cxDataCtlr elements of gnData and gnBase are used by IRIS Explorer for reference counting. The automatically generated API functions provide sufficient access to the data structures to make direct manipulation of structure elements by the programmer unnecessary.

3.2 Installation

Before installing a user defined type, the EXPLORERUSERHOME environment variable should be set to a directory in the user's file space. The make install command copies the files that are required to use gnBase to the relevant destination directories as shown below. If a directory did not exist it is created. If the files are created in $EXPLORERUSERHOME/types, the installation process will delete the .type file and gnBase will not be accessible in IRIS Explorer. The type is therefore normally built in a subdirectory of $EXPLORERUSERHOME/types before being installed.

$EXPLORERUSERHOME/types/
gnBase.type

$EXPLORERUSERHOME/lib/
libgnBase.a

$EXPLORERUSERHOME/include/cx/
gnBase.api.h
gnBase.api.inc
gnBase.h
gnBase.inc
gnBase.t

$EXPLORERUSERHOME/man/man3/
gnBase.man3

4. Using gnBase

In this section, the automatically generated Application Programmer's Interface (API) to gnBase is described (section 4.1) and examples of their use are provided in the form of user function files for modules that use the type (section 4.2).

4.1 The gnBase type API functions

Because the generation of the API functions is a general purpose automated process, some of the functions that are generated may be identical to others. For example, the gnBaseDataarrayLen function returns the length of the gnBase data array, i.e. the len element of gnBase, but there is also a function called gnBaseNumvariablesGet which also returns the value of this element.

4.1.1 gnBase functions

gnBase* gnBaseAlloc(signed long Numvalues);
Return a pointer to a new gnBase structure of given size
gnBase* gnBaseDup( gnBase *src );
Return a pointer to a new duplicate gnBase structure containing duplicate data
gnBase* gnBaseCopy( gnBase *src );
Return a pointer to a new duplicate gnBase structure containing no data
signed long gnBaseNumvariablesGet(gnBase *src, cxErrorCode *ec );
Return the number of variables in src
long gnBaseDataarrayLen( gnBase *src, cxErrorCode *ec );
Same as gnBaseNumvariablesGet
void gnBaseWrite( FILE *fd, int mode, gnBase *src );
Write gnBase structure to a file in either ascii (mode == 0) or binary (mode != 0) format
gnBase* gnBaseRead( FILE *fd );
Return a pointer to a gnBase structure read from a file
void gnBaseDataarrayRem( gnBase *src, gnData** *val, cxErrorCode *ec );
Delete gnData array of src
void gnBaseNumvariablesSet(gnBase *src,signed long val, cxErrorCode *ec );
Set the number of variables in src to val
gnData** gnBaseDataarrayGet( gnBase *src, cxErrorCode *ec );
Return a pointer to the data array of src
void gnBaseDataarraySet( gnBase *src, gnData** val, cxErrorCode *ec );
Set val to be the data array of src

4.1.2 gnData functions

gnData* gnDataAlloc( signed long Length, gnPrimType Type, signed long Levels);
Return a pointer to a new gnData structure with given properties
gnData* gnDataDup( gnData *src );
Return a pointer to a new duplicate gnData structure
gnData* gnDataCopy( gnData *src );
Same as gnDataDup, because gnData does not contain any reference counted structures
void gnDataWrite( FILE *fd, int mode, gnData *src );
Write gnData structure to a file in either ascii (mode == 0) or binary (mode != 0) format
gnData* gnDataRead( FILE *fd );
Return a pointer to a gnData structure read from a file
void gnDataIdentifierRem( gnData *src, cxstring * *val, cxErrorCode *ec );
Delete src identifier
void gnDataValuesRem( gnData *src, void *val, cxErrorCode *ec );
Delete src values
signed long gnDataLengthGet( gnData *src,cxErrorCode *ec );
Return the number of values in src
void gnDataLengthSet( gnData *src, signed long val, cxErrorCode *ec );
Set the length property of src to val
cxstring * gnDataIdentifierGet( gnData *src, cxErrorCode *ec );
Return a pointer to src identifier
void gnDataIdentifierSet( gnData *src, cxstring * val, cxErrorCode *ec );
Set the identifier of src to val
gnPrimType gnDataTypeGet( gnData *src, cxErrorCode *ec );
Return the type of src
void gnDataTypeSet( gnData *src, gnPrimType val, cxErrorCode *ec );
Set the type property of src to val
void* gnDataValuesGet( gnData *src, cxErrorCode *ec );
Return a pointer to the values of src
void gnDataValuesSet( gnData *src, void *member, cxErrorCode *ec );
Set the values of src to be member
long gnDataValuesLen( gnData *src, cxErrorCode *ec );
Same as gnDataLengthGet
cxPrimType gnDataValuesType( gnData *src, cxErrorCode *ec );
Return the IRIS Explorer primary type of src
void* gnDataValuesAlloc( gnData *src );
Return a pointer to a new values array for src

4.1.3 gnData factor functions

signed long gnDataLevelsGet( gnData *src, cxErrorCode *ec );
Return the number of levels of src
void gnDataLevelsSet( gnData *src, signed long val, cxErrorCode *ec );
Set the number of levels of src to be val
cxstring ** gnDataLabelsGet( gnData *src, cxErrorCode *ec );
Return a pointer to the labels of src
void gnDataLabelsSet( gnData *src, cxstring ** val, cxErrorCode *ec );
Set the labels of src to be val
long gnDataLabelsLen( gnData *src, cxErrorCode *ec );
Return the number of labels in src (Same as gnDataLevelsGet)
void gnDataLabelsRem( gnData *src, cxstring ** *val, cxErrorCode *ec );
Delete src labels

The last group of API functions provide access to the elements of the gnData structure that are only relevant when the structure type is gn_Factor. The automatically generated API code for these functions performs a check to ensure that the passed structure is of type gn_Factor before accessing the structure elements. If it is of the wrong type an error is generated. For example gnDataLevelsGet contains the following code.

signed long gnDataLevelsGet( 
    gnData *src
    ,cxErrorCode *ec )
{
    if (!src) {
        *ec = cx_err_error;
        return (signed long) 0;
    }

    if (src->gnType != gn_Factor) {
        *ec = cx_err_error;
        return (signed long) 0;
    }

    *ec = cx_err_none;
    return src->d.gn_Factor.levels;
}

4.2 Example modules using gnBase type

4.2.1 Read ascii file

This module reads in Variate data from an ascii file and outputs it in a gnBase structure. The module has a single parameter input port connected to a file browser and a single gnBase output. The format of the ascii file is

Number of variables
Number of values for first variable
First Variable identifier
First variable values
Number of values for second variable
Second Variable identifier
Second variable values

etc

Example data file for the Read ascii file module

3
7
Day
0 1 2 3 4 5 6
7
Temperature
10.2 12.7 15.9 13.6 14.4 11.6 12.3
7
Windspeed
25.2 20.6 20.8 22.8 15.3 14.8 15.7

User function file for the Read ascii file module

#include <cx/cxParameter.api.h>
#include <cx/cxLattice.api.h>
#include <cx/gnBase.api.h>
#include <cx/DataAccess.h>
#include <cx/DataOps.h>
#include <stdio.h>
#include <string.h>

void MemError (gnBase *gnb)
{
    if (gnb) cxDataRefDec(gnb);
    cxModAlert ("Unable to allocate memory");
    return;
}

void ReadAscii (char *filename, gnBase **DataOut)
{
    #define MAX 50 /* Maximum identifier length */
    FILE *in;
    int i, j, var, len;
    float val;
    gnData **Array;
    cxErrorCode err;
    char Buffer[MAX];
    char *id;

/* Attempt to open file, return if file cannot be opened */

    if (*filename == NULL) return;
    in = fopen(filename, "r");
    if (in == NULL) return;

/* Read number of variables and allocate new gnBase structure */

    fscanf (in, "%d", &var);
    *DataOut = gnBaseAlloc(var);
    if (*DataOut == NULL) {MemError(NULL);return;}

/* Get pointer to gnData array */

    Array = gnBaseDataarrayGet(*DataOut, &err);

/* Variable loop */

    for (i = 0; i < var; i++) {

/* Read length of this variate and allocate new gnData structure */

        fscanf (in, "%d", &len);
        Array[i] = gnDataAlloc(len, gn_Variate, NULL);
        if (Array[i] == NULL) {MemError(*DataOut);return;}

/* Read identifier and store in gnData structure */

        fscanf (in, "%s", Buffer);
        id = (char *) cxDataMalloc (strlen(Buffer));
        if (id == NULL) {MemError(*DataOut);return;}
        strcpy (id, Buffer);
        gnDataIdentifierSet (Array[i], id, &err);

/* Read and store values */

        for (j = 0; j < len; j++) {
            fscanf (in, "%f", &val);
            ((double *)gnDataValuesGet(Array[i], &err))[j] = val;
        }
    }
    fclose (in);
}

4.2.2 Print gnBase

This module prints out the contents of a gnBase structure. It has a single gnBase input port.

User function file for Print gnBase module

#include <cx/cxParameter.api.h>
#include <cx/cxLattice.api.h>
#include <cx/gnBase.api.h>
#include <cx/DataAccess.h>
#include <cx/DataOps.h>
#include <stdio.h>
#include <string.h>

void PrintAscii (gnBase *DataIn)
{
    #define FWIDTH 15 /* Field width of printed output */
    FILE *in;
    long i, j, var;
    gnData **Array;
    cxErrorCode err;
    gnPrimType type;
    long maxlen;

/* Get number of variables and gnData array pointer */

    var = gnBaseNumvariablesGet(DataIn, &err); 
    Array = gnBaseDataarrayGet(DataIn, &err);

/* Write variable identifiers and store maximum
   variable length */

    maxlen = 0;

    for (i = 0; i < var; i++) {
        printf ("%*s", FWIDTH, gnDataIdentifierGet
                    (Array[i], &err));
        if (gnDataLengthGet(Array[i], &err) > maxlen)
        maxlen = gnDataLengthGet(Array[i], &err);
    }
    printf ("\n");

/* Write values depending on type */

    for (j = 0; j < maxlen; j++) {
        for (i = 0; i < var; i++) {
            type = gnDataTypeGet(Array[i], &err);
            if (j < gnDataLengthGet(Array[i], &err)) {
                switch (type) {
                    case gn_Variate:
                        printf ("%*g", FWIDTH, ((double *)
                            gnDataValuesGet(Array[i], &err))[j]);
                        break;
                    case gn_Factor:
                        printf ("%*s", FWIDTH, (char **)
                            gnDataLabelsGet(Array[i], &err)
                            [((long *)gnDataValuesGet(Array[i], 
                            &err))[j]]);
                        break;
                    case gn_Text:
                        printf ("%*s", FWIDTH, ((char **)
                            gnDataValuesGet(Array[i], &err))[j]);
                        break;
                }
            }
            else {
                printf ("%*s", FWIDTH, "");
            }
        }
        printf ("\n");
    }
}

If the input from this module came from a read ascii module that had read in the example file in SS4.2.1 the printed output would be

Day Temperature Windspeed
0 10.2 25.2
1 12.7 20.6
2 15.9 20.8
3 13.6 22.8
4 14.4 15.3
5 11.6 14.8
6 12.3 15.7

4.2.3 Filter module

This module is an example of a gnBase filter that restricts the variate values to lie between a min and max set by the user. The usual way to create a filter module in the Module Builder [3][4] is to pass the parts of the structure that will not be affected by the filter directly from the input to the output port in the connections window and simply connect the parts of the structure to be changed to the function arguments. In this case just the type and values would need to be passed to the function arguments. However, gnBase differs from other IRIS Explorer types in that it contains a double pointer to a reference counted structure (gnData). The module builder is not currently able to create module data wrapper code for such a structure. Instead of casting the pointer as (gnData **), it attempts to cast it to (gnData), which fails. In effect, this means that the complete gnBase structure must be passed to the function arguments.

The module has gnBase input and output ports and two parameter input ports, min and max, that are connected to sliders or dials.

User function file for Filter module

#include <cx/cxParameter.api.h>
#include <cx/cxLattice.api.h>
#include <cx/gnBase.api.h>
#include <cx/DataAccess.h>
#include <cx/DataOps.h>
#include <stdio.h>
#include <string.h>

void Filter (gnBase *DataIn, gnBase **DataOut, double min, double max)
{
    long i, j, var;
    double *val;
    gnData **Array;
    cxErrorCode err;
    gnPrimType type;

/* Create duplicate of input gnBase structure */

    *DataOut = gnBaseDup(DataIn); 
    if (*DataOut == NULL) return;

/* Get number of variables and gnData array pointer */

    var = gnBaseNumvariablesGet(DataIn, &err);
    Array = gnBaseDataarrayGet(*DataOut, &err);

/* Variable loop */

    for (i = 0; i < var; i++) {

/* If this variable is a variate, restrict values */

        type = gnDataTypeGet(Array[i], &err);
        if (type == gn_Variate) {
            for (j = 0; j < gnDataLengthGet(Array[i], &err); j++) {
                val = &(((double *)gnDataValuesGet(Array[i],
                                &err))[j]);
                if (*val < min) {
                    *val = min;
                }
                if (*val > max) {
                    *val = max;
                }
            }
        }
    }
}

5. Conclusion

It has been demonstrated that a new data type can be successfully incorporated into IRIS Explorer. The new data type was taken from an application that was previously completed unrelated to IRIS Explorer. Due to the flexibility of IRIS Explorer typing, the type could be specified to exactly match the required data structure. The automatically generated API functions provide the programmer with a means to manipulate all parts of the data structure, without having to know about the underlying type definition. Examples of how the API functions could be used within modules were provided.

The inability of the module builder to interpret a double pointer to a shared structure within another shared structure meant that module data wrapper code could only be generated by the module builder when the complete data structure was passed between ports and function arguments. This means that when writing filter modules, the programmer has to copy the parts of the data structure that remain unchanged within the user function, rather than leaving this to the module data wrapper.

References

1. IRIS Explorer User's Guide (1995). The Numerical Algorithms Group Ltd

2. Genstat 5 Release 3 Reference Manual (1993). Genstat 5 Committee of the Statistics Department Rothamsted Experimental Station. Oxford University Press.

3. IRIS Explorer Module Writer's Guide (NT) (1997). The Numerical Algorithms Group Ltd

4. IRIS Explorer Module Writer's Guide (1997). The Numerical Algorithms Group Ltd