A White Paper by David Baka
At a fundamental level a program is
made up of operators and operands. Operands are areas in memory that hold data.
Operators are the instructions that modify the data.
Consider the following example in
pseudo-assembly:
Add 5 7
Here the operator is add, the operands
are 5 and 7. In more sophisticated programs data can be stored in many complex
ways such as arrays, tables, lists, linked lists, queues, dequeues, records,
collections, vectors, b trees, hash tables, sparse arrays, and few others. In
creating an initial design determining which data structure can become a
surmountable task because each has it’s own pros and cons.
First of all one has to consider how
the data will be operated on. The data is influenced by code which may change
it's value or scope depending on what code is applied to manipulate the data.
For example SQL that contains a join will change the scope of data that can be
viewed in the resulting record set.
To help determine the data design of
more complex programs a mathematical model called a Nadair can be used. The
Nadair has two main parts, the data structure and the influences on the data
structure. Here is a simple example:
T1 is a table
containing rows and columns, T2 is a table
containing rows and columns.
J represents the join between the two
tables. K is used to denote a list of Influences. the Symbol >=> means
“Acts On” denoting that the data structure follows will be acted on by the
proceeding influences,.
After the Symbol >=> is a complex set of symbols which represents the Baka Matrix. The number at the top left corner denotes the number of layers of the matrix. A matrix can represent any data structure and can be as abstract or defined as needed. In this case the matrix has one layer so the data structure appears to be two dimensional. It contains two elements both in brackets. The brackets denote that these elements are a Matrix as well but the structure is defined or will be defined somewhere else. In this case the two elements are tables. However if we wanted to use arrays instead of tables for our structure the model would not have to change only the meaning of the symbols would change. However the code that we would write to implement this would be considerably different.
What if we wanted our model to represent multiple join between several sets of two tables that we might use to create a view, what would our model look like? Look at the formula below.
For
the Influences J was replaced with J1 - Jn meaning that we could have 1 though n number of joins
involved. In the matrix then
number 1 was replaced by N meaning
there could be several layers of matrix depending on what N is equal to.
In
database management systems tables are have key fields which are used to relate
tables in a hierarchy of tables. For example you may have a table for
Departments and a table for Employee
Each
department has many employees. There is a one to many relationship between
departments and employees. The department table has the key field Department
Number and the Employee Table has the key field Employee Number. Each employee record will have a field which
contains a department number. This relates the record in the Department table
with the same department number to those records in the employee table. The
department record is said to be the parent record of the child employee record.
In the Nadair model this is denoted by putting the Department element
(represented by D) first followed by a semi-colon. The Employee element
(represented by E) is placed after it as in the following diagram.
A
common element in a data structure is called cardinality[1]
and is represented by 2 pipe symbols on both sides of the element. This data
element contains the number of elements in a structure. For example if we have an array S the number of elements in S would be represented as ||S|| .
This
is very useful when we have a data structure such as an array where we are
adding values to the elements of the array and want to keep track of where the
last element was written to. For example let say we are designing a program
that will load records into a database. For this database, only 3,000 records
can be loaded at a time. We need to load 50,000 records so there will have to
be some mechanism by which we only load batches of 3,000 records at a time.
To
do this we will assign a batch number
to each load. The records are
currently on a legacy system and must be formatted to fit the data definitions
of the new database. A new set of tables will be created to hold the formatted
data before sending it to the new database. During this formatting Batch
Numbers will be assigned to the records and when the 3,000th record is assigned
the batch number will be incremented by 1.
These
batch numbers will have to be stored in a structure so that when the records
are loaded the batch numbers can be recalled to load each batch in Sequence and
can keep track of the batches that are rolled back.
To
do this we might have an array with 2 X 20 elements running from 0 to 1 and 0
to 19. We can think of this array as two rows of numbers with row 0 being the
top row and row 1 being the bottom row as in the following diagram:
The first column
will hold the number of elements that are filled with batch numbers in element
0, 0 ( the top left most element). The bottom element of the first column will
hold a number representing the state of the process, 0 is for creating batch
numbers, 1 is for loading batches in the database. As the new batch numbers are
assigned the element 0,0 will hold the index (the numbers on top) of the last
element (last column) assigned a batch number.
The diagram below shows what happens after batch number 1000 was assigned.
Now
Element 0,0 contains 1 because the last column to be assigned a batch number is
column 1.
So
row 0 could be represented like this: ||B||,
B where B represents columns 1 through 19.
In a Baka Matrix the first row of our array would look like this:
In this case we use |[B]| instead of
||B|| because B is now defined as it’s own matrix [B].
It does not matter that both elements
|[B]| and [B} are going to be in the same array because we are not defining
what data structure we are using in the Baka Matrix. We are only defining what
elements make up the structure and how they are related. It’s an abstract model
not an exact definition. We could have the same functionality by using a
pointer and linked lists if we are using a language that supports them such as Pascal or C.
In the bottom row have a status indicator for the entire structure that represents whether the structure is being assigned batch numbers or the batch numbers are being read to load records for each batch. This type of element which desolates a state for the structure is symbolized by adding a sharp sign (#) after the symbol used to represent that status. In this case we will use S#
|[B]| and S# are elements that describe
the structure and are used by influences to determine what actions should be
taken. These elements are usually at
the beginning or “Head” of the structure and are referred to collectively as
the Header. To further denote the header information
the header element are usually placed
at the top of the Baka Matrix and may be separated from the rest of the
elements by a line. The following example shows this arrangement.
In this example L* represents the status for each column for Columns 1 - 19 which are stored on the bottom row of the array. In this case the Symbol * is used to denote that the element is a status indicator because it only indicates status for the column not the entire structure. L* will hold the status for loaded, rolled back or not loaded. The actual value stored will be a number that is used to represent these states.
Let’s continue with our example of
assigning record batches to an array for tracking batch numbers. If another
batch number is assigned then the data will look like this:
Element 0,0 has been changed to 2
because the last element to be assigned a batch number has the index number 2.
If my array is called batcharray then I can indicated the last batch number by
batcharray[0, batcharray[0,0] ][2]
because batcharray[0,0] holds the index of the last batch number assigned. This
is also important in my program because if I read past this column I will no
longer be accessing a valid batch number. Batcharray[0,0] is used while loading
batches to tell the program when to stop. It could be said that Batcharray[0,0]
“Points” to the last column of the array.
This is demonstrated in the diagram
below.
Technically Batcharray[0,0] is not
really a pointer is more like a handle. A pointer is variable type that
contains the memory address of another variable or structure. Batcharray[0,0]
contains a number or an index which hold the memory address of the array
elements. This is a method know as indirect addressing.
What changes will have to be made to
our Baka Matrix to indicate if the structure is loading or assigning
batches? The answer is None! The
assigning of values belong to functions not data structures. In order to
complete the design there will need to be two influences. One will assign batch
numbers and one will load records. These influence will have a lot of smaller
steps to them such as changing the status of indicators and setting the index
of the array in batcharray[0,0].
These details can be documented later.
They could even be documented as a flow chart, pseudo code, UML or whatever
methodology you prefer.
The
following diagram shows the Nadair after adding the Influences.
Here AB represents Assigning Batches and LB represents Loading Batches. To complete the design for this level of definition we will add layers because there are a several tables that must be loaded each with there own set of batch numbers. We will also show the definitions of the elements by listing them underneath the Nadair. The following diagram illustrates this.
AB:
Assign Batch (See Flowchart 1)
LB:
Load Batch (See Flowchart 2)
-
N:
The number of table to load
B:
The list of batch numbers
S:
Status of Loading or Assigning Batches
L:
Status of loaded, not loaded or rolled back
As
you can see this is a high level design The element descriptions are listed
below the Nadair. First the elements of the Influences and then separated by a
dash are the elements of the Baka Matrix.
The
N in the upper left hand corner stands for the number of tables that may be
loaded. But what if N = 0? Would that be an invalid state? What if we wanted to
specify in our design that there will always be at least one table that will be
used at any time the program is ran?
A new symbol called the existential[3], which is in the diagram below to represent the existence of at least one case.
In
this case it means “There is at least one x such that x has property R”. This
can be applied to a Nadair as in the following diagram.
Now there is at least one N where the
Nadair exists according to this updated design.
Up until now we have looked at the
concept of our program as what data structure we will use and then designed our
Nadair from it. Although this is useful for explaining the concepts behind the
Nadair, in practice we should design the Nadair first then use it to figure out
from the Nadair what data structure and functions best suit the purpose.
Let’s start off on a new idea for a
program. This program will mimick SQL by using a Graphical User Interface where
you select fields and commands from drop down lists on a form. The program will
be able to do everything SQL can do but does not require the user to write any
SQL code.
First let’s list all the things we can
do with SQL.
1) We can select a list of fields which
will display all the values for those fields for all records selected
2) We can limit our selection by specifying that for some field only certain values are desired
3) we can count, get the sum of, get highest number, get lowest number, etc for a numeric field
4) We can sort records, ordering them by field values in ascending or descending order.
5) We can insert records into a table
6) we can create a table from a list of values
7) we can update fields in a record
8) We can join tables together
There are many other functions as well but this is enough to keep us busy for awhile.