jvn2k07.files.wordpress.com … · Web viewDatabase and Database Management Systems. Basic...

Database and Database Management SystemsBasic Concepts and Definition (Source: Modern Database Management 6th Edition by Hoffer, Prescott & McFadden)

Master File

Data ElementsA to Z

Derivative Files

Accounting

Finance

Sales and Marketing

Manufacturing

Organizing Data in a Traditional File Environment

Users

Application

Program 1 A B C D

Users

Application

Program 2 A B D E

Users

Application

Program 3 A B E G

Users

Application

Program 4 A E F G

The use of traditional approach to file processing encourages each functional area to develop specialized applications. Each application requires a unique data that is likely to be a subset of the master file. These subsets of the master file lead to data redundancy, processing inflexibility, and waster data storage.

Personnel File

CustomerMaster

File

CustomerMaster

File

CustomerMaster

File

Select Sales Personnel

data

Sort into sequence

Print Report

Creating a report using a traditional file processing. In this example, three separate files – have been created and maintained by each respective division or department. In order to create a simple report consisting of a list of sales personnel by annual sales and principal customers, the three files had to be read, and an intermediate file had to be created. This required writing several programs. The table in the figure shows the information selected

Integrated Human Resources DB

Database Management

System

Personnel Application Programs

Payroll Application Programs

Benefits Application Programs

Payroll

Depa

Below: The Contemporary Database Environment. A single Human Resources database serves multiple applications and also allows a corporation easily to draw together all of the information on various applications. The database management system acts as the interface between the application

Personnel Depa

Benefits

Depa

EmployeesNameAddressSocial Security NumberPositionMarital Status

PayrollBasic PayOvertime PayGross PayIncome TaxOther DeductionsNet Pay

BenefitsLife InsurancePension PlanHealth Care PlanRetirement Benefits

Database A collection of logically related data organized in such a way that allows access, retrieval, and use

of that data. Or, a collection of data organized to serve many applications efficiently by centralizing the data and minimizing redundant data. Organized because data are structured so as to be easily stored, manipulated, and retrieved by users. Related because the data described a domain of interest to a group of users and that the users can use the data to answer questions concerning that domain.

May be of any size and complexity. May contain either data or information (or both).

Database Management The creation and maintenance of stored information in a computer system. The creation and maintenance of databases. Involves the monitoring, administration, and maintenance of the databases and database groups in

an enterprise.

Database Management Systems (DBMS) Special software to create and maintain a computerized database and enable individual business

applications to extract the data they need without having to create separate files or data definitions in their computer programs. They permit an organization to centralize data, manage them efficiently, and provide access to the stored data by application programs. It allow data addition, update, and deletion in the database; sort and retrieve data from the database; and create forms and reports from the data in the database.

Data Facts, text, graphics, images, sound, and video segments that have meaning in the users’

environment. Or, the collection of unprocessed items, which can include text, numbers, images, audio, and video.

Data Element A field in a record.

Information Data that have been processed in such a way as to increase the knowledge of the person who uses

the data. It is organized, meaningful, and useful.

Metadata Data that describe the properties or characteristics of other data. Some of these properties include

data definitions, data structures, and rules or constraints.

Traditional File Processing SystemsIn the beginning of computer-based data processing, there were no databases. Room-size

computers (considerably less powerful than today’s personal computer) were used almost exclusively for scientific and engineering calculations. Gradually computers were introduced into the business world. To be useful for business applications, they must be able to store, manipulate, and retrieve large files of data. Computer file processing systems were developed for this purpose. Although these systems have evolved over time, their basic structure and purpose have changed little over several decades.

As business applications became more complex, it became evident that traditional file processing systems had a number of shortcomings and limitations subsequently. As a result, these systems have been replaced by database processing systems in most critical business applications today. Nevertheless, you should have at least familiarity with file processing systems for the following reasons:

1. File processing systems are still widely used today, especially for backing up database systems.2. Understanding the problems and limitations inherent in file processing systems can help us avoid

these same problems when designing database systems.

Disadvantages of File Processing SystemsProgram-Data Dependence. The tight relationship between the data stored in files and the specific programs required to update and maintain those files. Every computer program has to describe the location and nature of the data with which it works. These data declarations can be longer that substantive part of the program. In a traditional file environment, any change in data requires a change in all of the programs that access the data. Changes, for instance, in tax rates or ZIP code length require changes in programs. Such programming changes may cost millions of dollars to implement in each program that requires the revised data.Duplication of Data (Data Redundancy). The presence of duplicate data in multiple data files. Data redundancy occurs when different divisions, functional areas, and groups in an organization independently collect the same customer information. Because it is collected and maintained in so many different places, the same data item may have different meanings in different parts of the organization. Simple data items like the fiscal year, employee identification, and product code can take on different meanings as programmers and analysts work in isolation on different applications.Lack of Flexibility. A traditional file system can deliver routine scheduled reports after extensive programming efforts, but it cannot deliver ad hoc (done or arranged only when a situation makes something necessary) reports or respond to unanticipated information requirements in a timely fashion. The information required by ad hoc requests is “somewhere in the system” but is too expensive to retrieve. Several programmers would have to work for weeks to put together the required data items in a new file. Users – in particular, senior management – begin to wonder at this point why they have computers at all.Poor Security. Because there is little control or management of data, access to and dissemination of information are virtually out of control. What limits on access exist tend to be the result of habit and tradition, as well as of the sheer difficulty of finding information.Limited Data Sharing. The lack of control over access to data in this confused environment does not make it easy for people to obtain information. Because pieces of information in different files and different parts of the organization cannot be released to one another, it is virtually impossible for information to be shared or accessed in a timely manner.Inconsistent Data. When the same data are stored in multiple locations, inconsistencies are inevitable. Inconsistencies in stored data are one of the most common sources of errors in computer applications. They lead to inconsistent documents and reports and undermine the confidence of users in the integrity of the information systems.Poor Enforcement of Standards. Data standards (data names, formats and access restrictions) are difficult to make known and enforce in a traditional file processing environment, mainly because the responsibility for system design and operation has been decentralized.Lengthy Development Times. Each new application requires that the developer essentially start from scratch by designing new file formats and descriptions and then writing the file access logic for each new program. The lengthy development times required are inconsistent with today’s fast-paced business environment, in which time to market (or time to production for an information system) is a key business success factor.Excessive Program Maintenance. Description of files, records, and data items are embedded within the individual application programs. Any modification to a data file (such as change of data name, data format, or method of access) requires that the program also be modified.

Application DBMS PhysicalPrograms Database

Program 1

Program 2

Program 3

Data Definition Language

Data Manipulation

Language

Elements of a DBMS (Database Management System). In an ideal database environment, application programs work through a DBMS to obtain data from the database. This diagram illustrates a DBMS with an active data dictionary that not only records definitions of the contents of the database but also allows changes in data size and format to be automatically utilized by the application programs.

Components of a DBMS

Data Definition (Description) Language. The component of a DBMS that defines each data element as it appears in the database. It is used to create and destroy databases and database objects. These commands will primarily be used by database administrators during the setup and removal phases of a database project. The four basic DDL commands are ALTER, CREATE, DROP, and USE.

Data Manipulation Language. A language associated with a DBMS by end users and programmers to manipulate data in the database. It is a family of syntax elements similar to a computer programming language used for inserting (INSERT command), deleting (DELETE command) and updating (UPDATE command) data in a database. Performing read-only queries (SELECT command) of data is sometimes also considered a component of DML. Structured Query Language (SQL) is the emerging standard data manipulation language for relational database DBMS.Data Dictionary. An automated or manual tool for storing and organizing information about the data maintained in a database. It stores definitions of data elements and data characteristics such as usage, physical representation, ownership (who in the organization is responsible for maintaining the data), authorization, and security. Many data dictionaries can produce lists and reports of data utilization, groupings, program locations, and so on. A data element represents a field. Besides, listing the standard name (BasicSalary), the dictionary lists the names that references this element in specific systems and identifies the individuals, business functions, programs, and reports that use this data element.

Logical and Physical Views of Data

Perhaps the greatest difference between a DBMS and traditional file organization is that the DBMS separates the logical and physical views of the data, relieving the programmer or end user from the task of understanding where and how the data actually stored.

The database concept distinguishes between logical and physical views of data. The logical view presents data as they would be perceived by end users or business specialists, whereas the physical view shows how data are actually organized and structured on physical storage media.

The logical description of the entire database, listing all of the data items and the relationship among them, is termed the schema. The specific set of data from the database that is required by each application program is termed the subschema.

The logical view is used to make the users life easier. In a database there is only one physical view, but there can be multiple ways a user can view each item, using their own logic that best suits them. The logical view is unique to any one user; each user can decide what they want to view and how they want to view items on a database. Not only can each individual chose what logical view they would like, but also they can set that view up as the default. It also allows people to understand the layout of how the database works, such as the architecture. This helps users understand the functions and workflows.

,Why would a person want to use the Logical View? Each individual usually has a specific task and differentiates from the next user in a business, but all employees are working off the same database. All users do not need to view all areas of the database, only the areas that pertain to their job. This eliminates the extra information that the specific user will not need. The user will be able to have a tree like structure or diagram that has a list of all items on the data base, this also allows the user to see what they can access, but they have the ability to choose items custom to their needs.

Data Dictionary

For example:

A bookkeeping firm will keep Accounts Receivable Records and Accounts Payable Records. The business keeps the checks and balance system in place they will have one employee for the Accounts Receivables (AR) and another employee for the Accounts Payable (AP). The accounts receivable clerk does not need to see the checks that are being cut by the accounts payables clerk because it does not pertain to the accounts receivable clerks’ job. The accounts receivable clerk will set their logical view to show the Customer Number and maybe the customers balance due. This helps the accounts receivable clerk view only what he/she needs and makes the database less complicated. The accounts payables clerk does not need to view which customers owe and what their balances due are. The accounts payables clerk only needs to know the bank balance and the invoices that need to be paid. So, the accounts payables clerk at this time will set their logical view to only view the tables they need.

The Logical view will improve a company’s procedures by allowing users to only access the tables or information that they need to. When in employee is doing date entry such as accounts receivables it is very repetitive and the employee does not need any more distractions. By simplifying the data that the employee is looking at will eliminates minor errors that can actually cause headaches to the other employees.

The logical view is a simple term, which helps the user only view the items they need for their specific task. The logical view is a view that is logical to the database user. The database only has one physical view, but can have many logical views. Each user can have their own logical view that is unique to the user and unique to a specific task. This will help the user so they are not looking at an abundance of information that they have no use for. This will help eliminate errors, because the user will not be overwhelmed and confused.

The word physical view comes from the viewing of the database. The viewing of database can be categorized into two categories: physical view and logical view. In viewing a database, there can be many different ways users can view a database depending on their needs and purposes. Physical view refers to the way data are physically stored and processed in a database. On the other side, logical view is designed to suit the need of different users by representing data in a meaningful format. Another word, the logical view tells the users, in their term, what is in the database. So, while there can be numerous logical views of a database to suit the needs of the users, there can only be one physical view of a database because physical view deals with the physical storage of information on a storage device.

Benefit of Different View

The capability of allowing different views (both logical views and physical views) in a database, gives users a lot of flexibility. With the logical views, the users can see data differently from how they are stored and most of all, users don’t need to know the technically detail of the physical storage. On the other hand, database specialist and advance users could benefit from the use of physical view of the database because through physical view, they can see the physical storage of the data. This allows specialist to make the database more efficient.

Understanding the Purpose of Different Views and Examples:

Understanding physical view and the difference between physical and logical views is very important. Imagine that if a database call “datasoft” hit the market with only one physical view, would the database sell and be very popular? Maybe not. This is because the database would get too technical that user has to look through the physical storage to see the database itself. How about if “datasoft” now has one physical view and one logical view, would the database sell and be popular? Maybe not, because there’s

only one way the users the view the database no matter how many users are there. Think about the employee database in a restaurant, it would store lots of data like employee id, social security id, numbers of complaints, hours worked, hourly rate, vacation time, sick days taken, taxes paid and wage paid. However, managers and employees doesn’t need to see all the information of that database. For employee, the data that are relevant to them are hour work and hourly rate, but they do not need to see data like numbers of complaints. For managers, data like numbers of complaints, hours work and wage paid would be necessary for them to monitor the cost and access employee performance. However, managers do not need see the data like social security as those are personal information of the employee. Hence, a database would work best when it has many different logical views for different users and each view would have only the relevant information for that users. The physical view of the database becomes useful when a database specialist design the database and makes the database efficient. The physical views contain the information of where the data are physically stored. In the example of the restaurant employee database, the whereabouts of physical storage of the database are irrelevant to manager and employee because they do not need to know that. However, when the database becomes substantially large and complicated, it’s important for database specialist to design the physical storage of the data so that data are well-organized and efficient to access. For example, if a database specialist tries to makes the large employee database of the State of California more efficient, he would have to see the physical view of the data to see where the data are stored and whether the data are scattered in different storage or one area. After database specialist analyzes the physical storage of the data using the physical view, then the database specialist would be able to make suggestion to improve the database efficient such as moving data onto different high speed storage devices to make accessing the data more efficient.

Advantages of DBMS Complexity of the organization’s information systems environment can be reduced by central

management of data, access, utilization, and security. Data redundancy and inconsistency can be reduced by eliminating all of the isolated files in

which the same data elements are repeated. Data confusion can be eliminated by providing central control of data creation and definitions. Program-data dependence can be reduced by separating the logical view of data from its

physical arrangement. Program development and maintenance costs can be radically reduced. Flexibility of information systems can be greatly enhanced by permitting rapid and

inexpensive ad hoc queries of very large pools of information.

Access and availability of information can be increased.

The feature list of a DBMS can only be considered an advantage if those features are essential in providing an effective solution to the application data management requirements. A quick summary of the features that address the questions above are:

Transaction Support– Atomic transactions guarantee complete failure or success of an operation. This includes automatic recovery of the database to a transaction consistent point in the event of an abnormal termination of the application (crash, power loss, etc.).

Concurrent Access– The ability to share data by controlling access to data items, many users (process or threads) can access data concurrently.

Data Normalization– A well-designed database schema can reduce storage requirements on the target storage media by reducing duplicate data.

Expandability, Flexibility, Scalability– A database system can scale easily to larger data sets.

Standards Enforcement– One example of this advantage would be to use the DBMS for all data storage requirements for the application. Multiple data structures can be manipulated using the same API functions. The can lead to reduced application development times and reduced maintenance costs in the future.

Fast Query Access– Databases allow indexing based on any attribute or data-property (i.e. SQL columns). This helps fast retrieval of data, based on the indexed attribute. This is an importance advantage as data-sets begin to grow large as it provides a more predictable query response time.

Interoperability– Connectivity through industry standard protocols allowing third-party tools to access and analyze data.

Data Integrity

Most companies realize that data is one of their more valuable assets — because data is used to generate information. Many business transactions take less time when employees have instant access to information. To ensure that data is accessible on demand, a company must manage and protect its data just as it would any other resource. Thus, it is vital that the data has integrity and is kept secure. For a computer to produce correct information, the data that is entered into a database must have integrity. Data integrity identifies the quality of the data. An erroneous member address in a member database is an example of incorrect data. When a database contains this type of error, it loses integrity. Data integrity is very important because computers and people use information to make decisions and take actions.

Garbage in, garbage out (GIGO) is a computing phrase that points out the accuracy of a computer’s output depends on the accuracy of the input. If you enter incorrect data into a computer (garbage in), the computer will produce incorrect information (garbage out).

Qualities of Valuable Information

The information that data generates also is an important asset. People make decisions dailyusing all types of information such as receipts, bank statements, pension plan summaries, stock analyses, and credit reports. In a business, managers make decisions based on sales trends, competitors’ products and services, production processes, and even employee skills.To assist with sound decision making, the information must have value. For it to be valuable,information should be accurate, verifiable, timely, organized, accessible, useful, and cost-effective.

Accurate information is error free. Inaccurate information can lead to incorrect decisions. For example, consumers assume their credit report is accurate. If your credit report incorrectly shows past due payments, a bank may not lend you money for a car or house.

Verifiable information can be proven as correct or incorrect. For example, security personnel at an airport usually request some type of photo identification to verify that you are the person named on the ticket.

Timely information has an age suited to its use. A decision to build additional schools in a particular district should be based on the most recent census report — not on one that is 20 years old. Most information loses its value with time. Some information, such as information about trends, gains value as time passes and more information is obtained.

Organized information is arranged to suit the needs and requirements of the decision maker. Different people may need the same information presented in a different manner. For example, an inventory manager may want an inventory report to list out-of-stock items first. The purchasing agent, instead, wants the report alphabetized by vendor.

Accessible information is available when the decision maker needs it. Having to wait for information may delay an important decision.

Useful information has meaning to the person who receives it. Most information is important only to certain people or groups of people.

Cost-effective information should give more value than it costs to produce. A company occasionally should review the information it produces to determine if it still is cost-effective to produce. Sometimes, it is not easy to place a value on information. For this reason, some companies create information only on demand, that is, as people request it, instead of on a regular basis. Many companies make information available online. Users then can access and print online information as they need it.

DESIGNING DATABASESThere are alternative ways of organizing data and representing relationships among data

in a database. Conventional DBMSs use one of three principal logical database models for keeping track of entities, attributes, and relationships.

Hierarchical Data Model. A database model that organizes data in a tree-like structure. A record is subdivided into segments that are connected to each other in one-to-many parent-child relationships.

ROOT

-

FIRSTCHILD

SECONDCHILD

A hierarchical database for a human resources system. The hierarchical database model looks like an organizational chart or a family tree. It has a single “root” segment (Employee) connected to lower-level segments (Compensation, Job Assignments, and Benefits). Each subordinate segment, in turn, connects to other subordinate segments. Here, Compensation connects to Performance Ratings and Salary History. The Benefits connects to Pension, Life Insurance and Health Care. Each subordinate segment is the “child” of the segment directly above it.

Network Data Model. A variation of the hierarchical model. Indeed, databases can be translated from hierarchical to network and vice versa in order to optimize processing speed and convenience. Whereas hierarchical structures depict one-to-many relationships, network structures depict data logically as many-to-many relationships. On other words, parents can have multiple “children” and a child can have more than parent.

A typical many-to-many relationship in which network DBMS excels in performance is the student-subject relationship. There are many subjects in a university and many students. A student takes many subjects and a subject has many students.

The Network Data Model. This illustration of a network data model showing the relationship the students in a university have to the subjects they take represents an example of logical many-to-

Employee

Compensation Job Assignments

Benefits

Performance Ratings

Salary History Pension Life Insurance Health

Subject 1 Subject 2 Subject 3

Student 1 Student 2 Student 3 Student 4 Student 5

many relationships. The network model reduces the redundancy of data representation through the increased use of pointers.

The Relational Data Model. The most recent of these database models, overcomes some of the limitations of the other two models. The relational model represents all data in the database as simple two-dimensional tables called relations. The tables appear similar to flat files, but the information in more than one file can be easily extracted and combined. Sometimes tables as referred to as files.

Columns (Fields)Table (Relation)

ORDER-NUMBE

R

ORDER-DATE

DELIVERY-DATE

PART-NUMBER

PARTAMOUN

T

ORDER-TOTAL

163416351636

02/02/201302/07/201302/10/2013

02/03/201302/10/201302/20/2013

152137145

231

140.0078.7522.50

PART-NUMBE

R

PART-DESCRIPTION

UNIT-PRICE

SUPPLIER-

NUMBER137145152

Door LatchDoor HandleCompressor

26.2522.5070.00

405820381125

SUPPLIER-

NUMBER

SUPPLIER-NAME

SUPPLIER-ADDRESS

112520384053

CBM Inc.Ace Inc.

James Corp.

44 Winslow, Gary IN 44950

Rte. 101, Essex NJ 07763

51 Elm, Rochester NY 11349

The Relational Data Model. Each table is a relation and each row or record is a tuple. Each column corresponds to a field. These relations can easily be combined and extracted in order to access data and produce reports, provided that any two share a common data element. In this example, the ORDER file shares the data element “PART-NUMBER” with the PART file. The PART and SUPPLIER files share the data element “SUPPLIER-NUMBER.”

Efficient- Well-organized (performing tasks in an organized and capable way).- Able to perform without waste (capable of achieving the desired result with the

minimal use of resources, time and effort).Effective

- Producing result (causing a result, especially the desired or intended result).- Producing favorable impression (successful, especially in producing a strong or

favorable impression on people.

ORDER Rows (Records, Tuples)

PART

SUPPLIER

Data Structure. A framework for organizing and storing data. Or, a class of data that can be characterized by its organization and the operations that are defined on it. Data structures are sometimes called data types.

Data structures are the basic building blocks of any physical database architecture. No matter what file organization or DBMS you use, data structures are used to connect related pieces of data. Although many modern DBMSs hide the underlying data structures, the tuning of a physical database requires understanding the choices a database designer can make about data structures. This appendix addresses the fundamental elements of all data structures and overviews some common schemes for storing and locating physical elements of data.

POINTERS

A pointer is used generically as any reference to the address of another piece of data. In fact, there are three types of pointers:

1. Physical address pointer contains the actual, fully resolved disk address (device, cylinder, track, and block number) of the referenced data. Using a physical pointer is the fastest way to locate another piece of data, but it is also the most restrictive: If the address of the referenced data changes, all pointers to it must also be changed. Physical pointers are commonly used in legacy database applications with network and hierarchical database architectures.

2. Relative address pointer contains the relative position (or “offset”) of the associated data from some base, or starting, point. The relative address could be a byte position, a record, or a row number. A relative pointer has the advantage that when the whole data structure changes location, all relative references to that structure are preserved. Relative pointers are used in a wide variety of DBMSs; a common use is in indexes in which index keys are matched with row identifiers (a type of relative pointer) for the record(s) with that key value.

3. Logical key pointer contains meaningful data about the associated data element. A logical pointer must be transformed into a physical or relative pointer by some table lookup, index search, or mathematical calculation to actually locate the referenced data. Foreign keys in a relational database are often logical key pointers.

File. A collection of logically related records (a structure of logically related fields, or elements, of info). It is maintained so that people can store data in a systematic and organized manner for quick and easy retrieval and updating.

File Organization. The physical arrangement of data on the backing storage devices (e.g., magnetic tapes or discs). Or, a technique for physically arranging records of a file on a secondary storage. There are several ways in which files can be organized, and for each, a separate method of accessing the records must be defined. The choice of organization method is often a compromise between the requirement for efficient maintenance of files and fast retrieval. Generally, it is also necessary to achieve a practical balance between storage and processing costs. Cost of data processing can be reduced if the data organization suits the characteristics and applications if logical data.

With modern relational DBMSs, you do not have to design file organizations, but you may be allowed to select an organization and its parameters for a table or physical file. In choosing a file organization for a particular file in a database, you should consider seven important factors:

1. Fast data retrieval 2. High throughput for processing data input and maintenance transactions 3. Efficient use of storage space 4. Protection from failures or data loss 5. Minimizing need for reorganization6. Accommodating growth 7. Security from unauthorized use

Types of Files

System Files. Files that include the operating system programs. They are used to start the computer system, and as you proceed, to provide services and control of software and files.

Applications Software Files. The software needed for an application, such as word processing. They are invoked by you whenever you used that specific application, but you will probably have little direct interaction with the applications files themselves.

Data Files. Files that hold data that is related to applications software, such as a memo using word processing software. They are used with applications software, either to supply input data, or to store the files you create.

Information Retrieval. Any file processing for the purpose of retrieving data to produce useful information.

Methods of Information Retrieval

Retrieval. The act of transferring a record from a secondary storage to main memory so that the data in its fields can be accessed.

Writing. The act of transferring a record from main memory to secondary storage. Insertion. Adding a new record to an existing file. Deletion. Removing a record from the file. Updating. Making changes to the contents of a record to reflect the new status of information maintained on file. Sorting. Rearranging the records in a file for the purpose of producing ordered reports. Merging. Combining two or more files in the same sequence into a single output file. It could be a record merging

or a file merging. Matching. Comparing a record against a record from two or more input files to ensure that there is a complete set

of records for each key. Mismatched records are highlighted for subsequent action. Searching. Looking for records with a certain key value or satisfying criteria.

Classification of Files by Function (Types of Data Processing Files)

1. Master File. A file that represents a static view of some aspect of an organization’s business at a point in time (e.g., a payroll master file, a customer master file, a personnel master file, an inventory master file). A record on a master file keeps track of the status of something (e.g., an employee, a customer, a product, an account). A master file is a more or less accurate snapshot of some aspect of the “real world” and contains relatively permanent or semi-permanent data or historical status data consolidated for reference and updating. The master file will have to be updated to reflect the current status of the data it contains. A special kind of master file is a dictionary file, which contains descriptions of data rather than the data themselves.

2. Transaction File. A file that contains or collects changes that are going to be applied to a master file. It may contain data to add a new record or to remove or modify an existing record on a master file. Each record on a transaction file represents an event or change in something whose status is tracked on a master file. For example, sales transaction file is used to update the stock master file.

3. Transition File. A temporary file created during processing for a specific use. For example, electricity meter readings and a customer’s details extracted from master and transaction files to form a statement details file. This is then used for the printing of monthly customer statements.

4. Report File. A file containing data that are to be formatted for representation to a user. The file may be spooled to a printer to produce a hard-copy or may be displayed on a terminal screen. A report file may be produced by a report-writer package or by an application program.

5. Work File. A temporary file used for the storing of intermediate data for further processing. It has neither the long-tern character of a master file nor the input or output character of a transaction file or report file. One common use of a work file is to pass data created by one program to another. For example, the work files created by the sort utility during an external sort.

6. Security or Backup File. An extra copy of a file to safeguard against the damage or loss of current versions.7. Audit File. A particular type of transaction files. They play the same role as the posting in a traditional ledger.

For example, in a sales ledger system, the transactions recorded might include invoice number, data, invoice, amount due and of cash received, cheque number, date credited. They will enable the auditor to check the correct functioning of computer-based procedures by keeping a copy of all the transactions that cause the permanent files to be changed.

8. Program File. A file that contains instructions for the processing of data which may be stored in other files or resident in main memory. The instructions may be written in a high-level language (e.g., COBOL, Pascal, C, Java), an assembler language, machine language, or a job control language. The instructions may be in the form of source code or may be the result of compilation, linking, interpretation, or other processing.

9. Text File. A file that contains alphanumeric and graphic data input using a text editor program. It may be processible only by that text editor, or may be stored in such a way that it can be processed by several editors.

Methods of File Organization

Sequential File Organization. A method in which records are organized in a particular order according to a key field. For example, a people file will be in order by a key that uniquely identifies each person, such as Social Security number or customer number. If a particular record in a sequential file is wanted, then all the prior records in the file must be read before reaching the desired record. Tape storage is limited to sequential file organization. Disk storage may be sequential, but records on disk can also be accessed directly.

Direct File Organization. A method in which records are not organized in any special order. It allows the computer to go directly to the desired record by using a record key. The computer does not have to read all preceding records in the file as it does if the records are arranged sequentially. It requires a disk device called a direct-access storage device (DASD) because computer can go directly to the desired record on the disk. It is this ability to access any given record instantly that has made computer systems so convenient for people in service industries – for catalog order-takers determining if a particular sweater is in stock, or bank tellers checking individual bank balances. An added benefit of direct access organization is updating in place (the ability to read, change, and return a record to its same place on the disk).

If we have a completely blank area on the disk and can put records anywhere, then there must be some predictable system for placing a record at a disk address and then retrieving the record at a subsequent time. In other words, once the record has been placed on a disk, it must be possible to find it again. This is done by choosing a certain formula to apply to the record key, thereby deriving a number to use as the disk address. Hashing (or randomizing) is the name given to the process of applying a mathematical operation to a key to yield a number that represents the address. Even though the record keys are unique, it is possible for a hashing scheme to produce the same disk address, called a synonym, for two different records; such an occurrence is called a collision. There are various ways to recover from a collision; one way is simply to use the next available record slot on the disk.

Indexed Sequential Organization. A method in which records are organized sequentially, but indexed built into the file allows a record to be accessed either sequentially or directly. Records are also stored in physical sequence according to the primary key. The file management system builds an index, separate from data records, that contains key values together with pointers to the data records themselves. This index permits

individual records to be accessed at random without accessing other records. The entire file can also be accessed sequentially.

This type of file consists of three (3) main parts, namely:

The file index – holds the set of “pointers” to enable individual records to be located. A set of index contains the relevant record keys and corresponding record addresses. Access and retrieval of a specific record is affected through the use of the index.

The prime or home area – is where the data records are loaded in sequential order when the file is first created.

The overflow area – is where additions to the file that cannot be accommodated in the prime area are stored. The overflow records are linked to the rest of the file through a system of pointers maintained in the index.

Modes of Accessing Files

1. Input2. Output3. Input/Output

An input file is only read by the program. For example, a file of tax rates and tables would be an input for the program that computes income taxes. A transaction file is generally an input file to an update program. A program of source code is an input file to a compiler program.

An output file is only written to by a program; it is created by the program. For example, a report file may be the output of a program that updates a master file. A program file of object code is an output file from a compiler program.

An input/output file is both read from and written to during a program’s execution. For example, the payroll master might be used by the payroll program both as a source of data about employee pay rates and as a repository of month-to-date and year-to-date pay totals. An input/output file could be created by one phase of a program, then either modified or read by another phase of the same program. A master file commonly is an input/output file, as are the work files of sort programs.

File Organizations

1. Sequential2. Relative3. Indexed Sequential4. Multi-key

Two Basic Ways that file organization techniques differ

1. The organization determines the file’s record sequencing, which is the physical ordering of the records in storage.

2. The file organization determines the set of operations necessary to find particular records. Individual records are typically identified by having particular values in search-key fields. This data field may or may not have duplicate values in the file; the field can be a group or elementary item. Some file organization techniques provide rapid accessibility on a variety of search keys; other techniques support direct access only on the value of a single one.

The organization most appropriate for a particular file is determined by the operational characteristics of the storage medium used and the nature of the operation to be performed on the data. The most important characteristic of a storage device that influences selection of a file organization technique (or, to turn the

argument around, that influences selection of a storage device once the appropriate file organization technique has been determined) is whether the device allows direct access to a particular record occurrences without accessing all physically prior record occurrences that are stored on the device/medium, or allows only sequential access to record occurrences. Magnetic disks are examples of direct access storage devices (DASDs); magnetic tapes are examples of sequential access storage devices.

File Operations

The way that a file is going to be used is an important factor in determining how the file should be organized. Two major aspects of a file’s use are its mode of use and the nature of the operations on the file.

A file can be accessed by a program that executes in batch mode or by a program that executes interactively. With the batch mode access, transactions generally can be sorted to improve master file access, whereas with interactive access the transactions are processed as they arrive. With batch mode access, performance is typically measured by throughput (the number of transactions processed in a time period). With interactive access, performance is also measured by response time to individual transactions. Some file organizations are better suited to support interactive access than are others.

Fundamental Operations that are performed on files

1. Creation2. Update, including:

a. Record Insertionb. Record Modificationc. Record Deletion

3. Retrieval, including:a. Inquiryb. Report Generation

4. Maintenance, including:a. Restructuringb. Reorganization

Creating a File

The initial creation of a file is also referred to as the loading of the file. The bulk of the work in creating transaction and master file involves data collection and validation. In some implementations, space is first allocated to the file, then the data are loaded into that skeleton. In other implementations, the file is constructed a record at a time. In many cases, data are loaded into a transaction or master file in batch mode, even if the file actually is built one record at a time. Loading a master file interactively can be excessively time-consuming and labor-intensive if large volumes of data are involved.

Updating a File

Updating is changing the contents of a master file to make it reflect a more current snapshot of the real world. These changes may include (1) the insertion of new record occurrence (e.g., adding a record for a newly hired employee), (2) the modification of existing record occurrences (e.g., changing the pay rate for an employee who has received a raise), and (3) the deletion of existing record occurrences (e.g., removing the record of an employee who has left the company). The updated file then represents a more current picture of reality.

In some implementations, the records of a file can be modified in place, new records can be inserted in available free space, and records can be deleted to make space available for reuse. If a file is updated in place by a program, then the file usually is an input/output file for that program.

Some implementations are more restrictive and a record cannot be updated in place. In these cases, the old file is input to an update program and a new version of the file is output. The file is essentially recreated with current info. However, not all of the records need to have been modified; some (maybe even most) of the records may have been copied directly from the old version to the updated version of the file.

Retrieving from a File

Retrieval is the access of a file for purposes of extracting meaningful info. There are two basic classes of file retrieval: inquiry and report generation. These two classes can be distinguished by the volume of data that they produce. An inquiry results in a relatively low-volume response, whereas a report may create many pages of output. However, some installations prefer to distinguish between inquiry and report generation by their modes of processing. If a retrieval is processed interactively, these installations would call retrieval an inquiry or query. If a retrieval is processed in batch mode, the retrieval would be called report generation. This terminology tends to make report generation more of a planned, scheduled process and inquiry more of an ad hoc, spontaneous process. Both kinds of retrieval are required by most info systems.

An inquiry generally is formulated in a query language, which is ideally is a natural–language like structure that is easy for a “non-computer-expert” to learn and to use. A query processor is a program that translates the user’s inquiries into instructions that are used directly for file access. Most installations that have query processors have acquired them from vendors rather than designing and implementing them in-house.

A file retrieval request can be comprehensive or selective. Comprehensive retrieval reports info from all the records on a file, whereas selective retrieval applies some qualification criteria to choose which records will supply info for output. Examples of selective retrieval requests formulated in a typical but fictitious query language are the following:

FIND EMP_NAME OF EMP_PAY_RECORD WHERE EMP_NO = 12751 FIND ALL EMP_NAME, EMP_NO OF EMP_PAY_RECORD WHERE EMP_DEPT_NAME = ‘MIS” FIND ALL EMP_NAME, EMP_NO OF EMP_PAY_RECORD WHERE EMP_SAL IS BETWEEN 20000 AND

40000 FIND ALL EMP_NAME, EMP_AGE, EMP_PHONE OF EMP_PAY_RECORD WHERE EMP_AGE < 40 AND

EMP_SEX = “M” AND EMP_SAL > 50000 COUNT EMP_PAY_RECORD WHERE EMP_AGE < 40 FIND AVERAGE EMP_SAL OF EMP_PAY_RECORD WHERE DEPT_NAME = “MIS”

In each case the WHERE clause gives the qualification criteria. Note that the last two queries apply aggregate functions COUNT and AVERAGE to the qualifying set of records. Some file organizations are better suited to selective retrievals and others are more suited to comprehensive retrievals.

Maintaining a File

Maintenance activities are changes that are made to files to improve the performance of the programs that access them. There are two basic classes of maintenance operations: restructuring and reorganization. Restructuring a file implies that structural changes are made to the file within the context of the same file organization technique. For example, field widths could be changed, new fields could be added to records, more space might be allocated to the file, the index tree of the file might be balanced, or records of the might be re-sequenced, but the file organization method would remain the same. File reorganization implies a change from one file organization to another.

The various file organizations differ in their maintenance requirements. These maintenance requirements are also very dependent upon the nature of activity on the file contents and how quickly that activity changes. Some implementations have file restructuring utilities that are automatically invoked by the operating system;

others require that data processing personnel notice when file activity has changed sufficiently or program performance has degraded enough to warrant restructuring or reorganization of a file. Some installations perform file maintenance on a routine basis. For example, a utility might be run weekly to collect free space from deleted records, to balance index trees, and to expand or contract space allocations.

In general, master files and program files are created, updated, retrieved from, and maintained. Work files are created, updated, and retrieved from, but are not maintained. Report files generally are not updated, retrieved from, maintained. Transaction files are generally created and used for one-time processing.

File Systems

Accessing (reading and writing data in files requires a great deal of activity that is transparent to the application programmer. Programming languages enable programmers to define rather complex file organization techniques with quite simple statements. A file system provides support to enable programmers to access files without being concerned about the details of storage characteristics and device timings. The file system converts the programmer’s relatively simple file access statements to low-level I/O instructions. A programmer’s simple request to READ or WRITE a record on a file typically invokes a complex sequence of supporting device management operations. The programmer’s job would be considerably more difficult if it involved contending with these detailed I/O control operations.

Responsibilities of a File System

1. Maintaining a directory of file identification and location info.2. Establishing pathways for data flow between main memory and secondary storage devices.3. Coordinating communication between the central processing unit (CPU) and secondary storage devices

and vice versa, including:a. Handling the imbalance in speeds of the computer’s CPU and storage devices in such a way that the

CPU does not spend an inordinate amount of time waiting idly for I/O operations to be completed.b. Handling data in such a way that they can be held if the sender (CPU or secondary storage device)

and the receiver (secondary storage device or CPU) are not ready at the same time.4. Preparing files for input or output use.5. Handling files when their input or output use are terminated.

File Directories

Before a file can be accessed by a program, the file system must know where the file is located. Nearly all file systems use some sort of directory structure to manage the identification and location info about files.

Different systems use different structures for their file control blocks. One representative system constructs two tables to describe each file. The first contains info on file organization and processing mode, descriptions of record sizes and types, blocking sizes, error counts and flags, info used to determine file status and position, and the file’s logical (external) name. It also contains a pointer to the file’s second descriptive table, which is associated with a particular storage device and contains info about its physical characteristics and current status. Device controller routines use this table to interface requests to the file and to monitor the status of I/O activities on the filel. Control info in this system is modularized between the tables to improve the efficiency of I/O processing.

jvn2k07.files.wordpress.com … · Web viewDatabase and Database Management Systems. Basic...

Documents

Transcript of jvn2k07.files.wordpress.com … · Web viewDatabase and Database Management Systems. Basic...