Managing data
Network Databases
There are two main types of network databases.
- Centralised: All the data in the database is stored in a single physical location i.e. on one server, and the data can be accessed from multiple locations simultaneously. Advantages include reduced data redundancy, and therefore lower probability of anomalies, and more manageable security restrictions. Disadvantages are the potential for the system to run slowly if many users are connected and the reduced data redundancy, because if the central computer is damaged nobody is able to access the database.
- Distributed: The database is stored in a number of physical locations connected with a network. Distributing data normally increases the speed of the data and data can be replicated as a failsafe (or to increase speed). Obviously anomalies are more likely with a distributed database; they are also more costly, and it is difficult to handle multiple operations performing on the database at once.
Distributing Data
if using a distributed database, data can be distributed in any of a number of ways. These include duplicating all of the data, storing complete tables at different sites, or 'partitioning' tables by breaking them up.
There are two main methods for partitioning.
- Horizontal: Data is distributed across sites based on the primary keys, i.e. the data is split at some row.
- Vertical: Data is distributed based on columns, with the primary key being replicated so that the data can be joined together i.e. the data is split at some column.
Data Warehousing
A data warehouse is a repository for information that collects information from a number of sources and transforms them into a data model that can be queried and analysed. A data warehouse should make data easily accessible, present information consistently, and be secure.
The following are characteristics of a data warehouse.
- Subject Oriented: Data is organised based on significant subject areas.
- Time Variant: Data is accurate for the present moment whenever it is recorded.
- Non-Volatile: Data cannot be changed once inserted.
- Integrated: Data from a number of sources is integrated into one source.
Data Mart
A data mart is a subset of the data warehouse containing information about one specific business unit. Benefits include the fact that a data mart is cheaper than a full warehouse and it improves performance for its subject area.
Data marts are distinct from data warehouses in that they usually hold less detailed information and only contain data regarding a specific subject area.
Data Mining
Data mining is the process of analysing large amounts of data to determine hidden patterns and trends. Data warehouses are optimised to be analysed in this manner.
Legal/Ethical issues
There are privacy concerns associated with data warehousing; for example, what types of data can be used and analysed by the company, and who owns the data.
There is also the issue of the legality of using data in making decisions, such as gender or ethnicity.
Data Dictionaries
A data dictionary is a table describing all the data in a database, containing relevant information about each attribute in the tables of a database.
Data dictionaries usually have headers similar to the following:
Element Name | Data Type | Size / Format | Default | Description | Constraints |
---|
(Each of the fields should be fairly self-explanatory as to what they do.)
One data dictionary should be created for each table.
Data Integrity
There are three main types of data integrity.
- Domain integrity: This specifies the values that are valid for a column. Includes restrictions defined by constraints, or being a primary key.
- Referential integrity: This refers to the relationships between tables, essentially ensuring that every foreign key refers to a valid primary key.
- Entity integrity: This refers to the records within a table, ensuring that every table has a primary key unique to each record and not null.
Data Anomalies
There are three types of data anomalies that can occur. (Data that is properly normalised will never have anomalies.)
- Update anomaly: Refers to an update having to be done in multiple locations in a table, which is redundant. For example, if a person's name was stored in multiple locations and their name was updated, all of the instances of their name would have to be updated. This can cause inconsistencies if the update does not go to completion.
- Insertion anomaly: Refers to an insertion being impossible because other data is absent. For example, suppose a table stored people's names with reference to their offices in one table. If a new employee was hired, there would be no way of storing their information in the database because they have not been assigned an office yet. (If they were assigned an office immediately after being hired this would not be an anomaly.)
- Deletion anomaly: Refers to a deletion causing the unintended loss of other data; similar to an insertion anomaly but reversed. For example, take the example of people's names referring to their offices as above. If a person left the company the information about their office would be lost with the information about them. (If their office was demolished immediately after leaving the company this would not be an anomaly.)
Relations
A relation is a set of tuples with the same attributes, also known as a table or an entity.
Relations have the following properties:
- Uniqueness: No attributes can be repeated. For example, the following is not a relation: Person(PersonID, Phone1, Phone2)</span> because the Phone attribute is repeated.
- Atomicity: Each cell must contain only a single value. Similar to uniqueness but within a cell. For example, the Phone attribute can only contain one phone number.
Relation definitions
A relation definition looks like this:
Relation(PrimaryKey, Attribute)
The primary key must always be underlined. Foreign keys do not have to have the table name specified (each table should have a differently named primary key, for this reason) but should have the word "FK" after them.
Normalisation
Normalisation is the process of eliminating anomalies and redundancies in data, improving its integrity and efficiency. The process is designed to improve the design of a database.
To start normalising, data must be organised as a relation, defined above.
There are three normal forms (four including the unnormalised form):
- 1st Normal Form: There are no repeating elements or groups of elements.
- 2nd Normal Form: There are no partial dependencies on a concatenated key.
- 3rd Normal Form: There are no dependencies on non-key attributes.
There is a method for normalisation, described below.
- Identify the attributes for each subsection
- Move the attributes to smaller relations
- Identify / create a PK for each relation
- Identify / create a FK for the relation you moved attributes from
- Check all of the non-key fields are fully 'functionally dependent' on the PK
'Functionally dependent' means that the primary key is necessary to find the attribute. For example, if a person's salary is determined by their role, their salary is not functionally dependent, and the table would need to be split up.
In exams you will only ever be asked to normalise to 3NF (3rd Normal Form), and doing practice normalisation will assist you greatly in doing this. Over time it should become second nature; essentially you wish to just split the tables up until there are no many-to-many relationships between tables (see ER Diagrams), and so that no anomalies are possible.
ER Diagrams
ER (Entity Relationship diagrams) are diagrams that graphically represent the relationships between tables in a database.
ER Diagrams have a number of symbols. A rectangle is an entity (or table), which is titled with the name of the table. A diamond is a relationship, labelled with the 'type' of relationship (usually 'has' is sufficient). Attributes are ovals.
To represent a relationship, draw a line between the two tables. Draw a diamond in the center to represent the relationship, and name it with something appropriately describing the relationship, normally in a one word verb. Then draw the cardinality on.
To represent the attributes of a table, draw an oval for each of the attributes, and connect it to the rectangle representing the table with a line. Similarly to a relation definition, primary keys should be underlined and foreign keys should have the word "FK" after them.
The cardinality of a relationship refers to the number of entities in a relationship. For example, suppose one person can own multiple toys. The cardinality of the Person:Toy
relationship would be 1:M
, to represent One Person can own Many Toys. There are four possible cardinalities: 1:1
, 1:M
, M:1
, and M:M
. (1:M
and M:1
could be considered equivalent depending on perspective.)
Cardinalities are drawn on the relationship lines next to the tables. For example, with the example given above, a small 1
would be drawn next to the Person
table and a small M
next to the Toy
table. (Ensure that the cardinalities are drawn close to the line so that there is no confusion about which relationships they refer to.)
Resolving many-to-many relationships
A many-to-many relationship must be resolved as it undermines ether the uniqueness or atomicity of a relation. If a many-to-many relationship exists, a foreign key field must be either repeated or contain multiple values in one of the tables.
To resolve a many-to-many relationship, a 'bridging table' must be created. This should contain a foreign key to one of the tables, a foreign key to the other, and any information that makes sense to be in the bridging table. If the combination of foreign keys of the two tables is unique, they can be used as a composite primary key (this is represented by underlining both of them in the ER diagram or in the relation definition); otherwise, another primary key must be created.
For example, a purchase of a toy from an online store might be represented as such:
Purchase(PurchaseID, UserID FK, ToyID FK, Date)
Since date should be a part of the bridging table, it is listed as an attribute (and obviously removed from whatever table it was in before).
Design considerations
A number of design considerations must be taken into account when making a user interface for a database system.
- Readability: The user interface must be readable. This includes factors such as contrast, font size, font family etc. (Sans serif fonts are considered to be better on the web than serif fonts.)
- Navigation: The user interface should be easily navigable.
- Logical order: Forms and the interface in general should be logically ordered. For example, typically, first name will come before last name.
- Inclusivity: The user interface must be usable for any user of the system, and must not discriminate on any conditions. For example, if colourblind users will be using the interface, the system must be designed to be readable for colourblind users.