organizing data - ousl

62
Organizing data

Upload: others

Post on 18-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Organizing data

Learning Outcome

1. make an array

2. divide the array into class intervals

3. describe the characteristics of a table

4. construct a frequency distribution table

5. constructing a composite table

Raw data

• Raw data is set of values recorded in a study or survey.

• Primary data : Obtained directly from an individual, gives a precise information

• Secondary data : Obtained from outside source. Ex :- Hospital record, Census

• Data should be arrange in a meaningful way.

Array

• The basic arrangement of numeric data is called an ARRAY.

• Array is the derived data from fundamental data

Example :-

To store marks of 50 student

Frequency and Frequency Distribution Table

• A frequency refers to the number of observations in each category of a variable.

• A frequency distribution table describes the frequency distribution of a set of measurements.

• When presenting the frequency distribution in a table format, it becomes a frequency distribution table.

For nominal variable “sex”,

Table 2: Frequency distribution table of pre-school children attending the child welfare clinic

Frequency Percentage

Female 6 60.0

Male 4 40.0

Total 10 100.0

• Similarly the frequency table can manipulate on Ordinal and Discrete variable also

• The frequency table can be performed For continuous variable, however the list would be too long. (The reason is that most of them occur only once or sometimes twice)

• This data need to divide in to segments • These segments are called “classes” or “class

intervals”. • The end points of the class intervals are specified

as “class limits”.

• Example: Heamoglobin values (g/dl) obtained

from 35 men at the diabetic clinic (this has been made into an array)

7.7 9.8 10.6 11.4 12.3 12.6 8.6 9.9 10.9 11.6 12.3 13.4 8.7 9.9 10.9 11.6 12.4 15.2 9.0 10.1 11.0 11.7 12.5 15.3 9.2 10.2 11.0 11.8 12.5 16.7 9.4 10.2 11.3 12.2 12.6

Example of class limits:

• For the class interval of 6 – 7.9 g/dl; the lower limit – is 6 and the upper limit is 7.9.

• For the class interval of 8 – 9.9 g/dl; the lower limit – is 8 and the upper limit is 9.9.

• Dividing the array into class intervals should be done according to the researchers‟ objective or the size of the data.

• Usually the number of class intervals vary between 5-12.

• Too few class intervals would obscure details and lead to loss of information, whereas too many defeat the very purpose of grouping.

General rules of class intervals

i. The class intervals should not be less than 5

ii. The class limits should not over lap

iii.The class intervals should be of equal size

• As in example 1 the range is from 7.7g/dl to 16.7g/dl. We could divide this data into 6 class intervals as follows:

6.0 – 7.9 g/dl

8.0 – 9.9 g/dl

10.0 – 11.9 g/dl

12.0 – 13.9 g/dl

14.0 – 15.9 g/dl

16.0 – 17.9 g/dl

Composite table

• There are two other frequencies; – Relative frequency – cumulative frequency

• Relative frequency: is the number of observations falling into each interval as a fraction of the total number of observations.

Relative frequency = Frequency Total frequency Relative frequency can be presented as percentages too. • Relative frequency percentage = Frequency × 100 Total frequency

• Cumulative frequency (or percentage): is the fraction (or percentage) of observations that are less than the upper limit of each interval

Cumulative frequency = Previous frequency + current frequency

Data presentation

• Data can be presented in two ways 1. Tables Can be Simple or Complex depending upon the number of measurements of single set or multiple sets of items. 2. Diagrams or Graphs Graphic presentations used to illustrate and clarify information. Tables are essential in presentation of scientific data and diagrams are complementary to summarize these tables in an easy, attractive and simple way.

Characteristics of a table

A statistical table has at least five major parts and some other minor parts. The major parts are; 1) A table number 2) The title 3) The column caption 4) The row captions 5) The totals

The minor parts are 6) Units of measurements in the rows and columns if appropriate 7) Foots notes 8) Source Notes

1)Table Number – the number should be continuous with the number of the previous table if not the first. 2) The Title: A title is the main heading shown at the top of the table. It must explain the contents of the table and throw light on the table. Different parts of the heading can be separated by commas but no full stops should be used in the title. 3) The column caption: The vertical heading and subheading of the column are called columns captions. Only the first letter of the first word of the column caption is in capital letters and the remaining words must be written in small letters.

4) The row captions: The horizontal headings and sub heading of the row are called row captions. Only the first letter of the first word of the row caption is in capital letters and the remaining words must be written in small letters.

5) The Body: It is the main part of the table which contains the numerical

information classified with respect to row and column captions.

6) Units of measurement: if appropriate, should be enclosed in brackets which describe the units of the column caption or the row caption.

7) Foot Notes: It appears immediately below the body of the table providing the further additional explanation.

8) Source Notes: The source notes is given at the end of the table indicating the source from when information has been taken. It includes the information about compiling agency, publication etc…

General Rules of Tabulation:

1. A table should be simple and attractive. There should be no need of further explanations.

2. Suitable approximation may be adopted and figures may be rounded off.

3. The unit of measurement should be well defined.

4. If the table has two variables, – the independent variable should be on the columns

– the dependent variable should be in the row

– the total should add up to the column variable

5) If the table is a composite table the above rule 4 would be changed to

– the independent variable on the rows

– the dependent variable on the column

– the total adds up to the row variable

Presenting categorical data graphically

• Types of diagrams

• Present categorical data graphically (bar chart, component bar chart, cumulative bar chart, pie diagram)

Presenting categorical data graphically

• The most effective way to present information is by visual display

• Graphs are frequently use in statistical analyses

– As a mean of uncovering pattern in a set of data

– As a mean of conveying the important information – concise and accurate fashion

Types of categorical data graphs

• Pictograph

• Pie chart

• Bar chart

• Dot plot

Pictograph

• Pictograph is a way of presenting categorical data using symbolic figures to match the frequencies of different categories of a variable.

• Every pictograph has a title, label and a legend.

• Advantages of pictograph 1. A pictograph is easy to read. 2. They show trends in data. 3. They are fun to use. 4. Specially useful when data are presented to a large audience in a limited time.

• Disadvantages of Pictograph 1. It may be difficult to find a suitable symbol or picture to represent the data. 2. The key can be confusing to the reader

Pie chart

• Important to observe the relative sizes of the categories of a single variable

• Every pie chart has a title, label and a legend to show what each category represents.

Disadvantages of Pie chart

1. They are best used for displaying statistical information when there are no more than six to seven categories in a variable—otherwise, the resulting picture will be too complex to understand. 2. Actual frequencies need to be converted to percentages. 3. Total frequencies are unknown unless specified. 4. Difficult to compare 2 pie charts of the same variable in different groups. 5. "Other" category, when present, can be a problem in interpretation. 6. Pie charts are not useful when the values of each component are similar because it is difficult to see the differences between slice sizes.

Bar Chart and types of bar chart

• Bar chart is better if individual frequencies are of interest

• It can be used more than just frequency data

• Comparison of categories are easier

• Can also be used to present numeric variables grouped into class intervals.

• A bar chart may be either horizontal or vertical.

Simple bar chart

Compound bar chart

• A compound bar graph represents data of two variables of which one should be a categorical variable and the other a categorical or a discrete variable

Component bar graphs

• The component bar graph is a preliminary data analysis tool used to show segments of a total.

• The component bar graph can be very difficult to interpret

– if there are too many segments in a bar.

– if the bars are not standardized ( that is if each bar does not add up to 100%)

Dot Plot

• A dot plot is a type of graphic display used to compare the frequency of categories of a variable.

• It is made up of dots plotted on a graph. • Each dot represents a single observation from a set of data • The dots are stacked in columns over categories, so that the

height of the column represents the relative or absolute frequency of observations in the category.

• The pattern of data in a dot plot can be described in terms of symmetry and skewness only if the variable is discrete.

• Compared to other types of graphic display, dot plots are used most often to plot frequency counts within a small number of categories, usually with small sets of data.

• Numerical variables which belong to either interval or ratio scale can also be presented using frequency distribution and graphs after grouping.

Presenting Quantitative data graphically

• Histogram

• Frequency polygon

• Stem and leaf plot

• Box plot

• Scatter plot

Histogram

Histogram is used mainly to summarize continuous data and seldom used for discrete data

• A pictorial representation of the frequency distribution of a numerical variable

Characteristics of a Histogram

• A histogram is made up of columns plotted on a graph. Unlike in a

bar chart, there are no spaces between bars in a histogram. • A histogram will have bars of equal width, although this is not the

case when class intervals vary in size. • Values of the variable being studied are presented on the horizontal

x-axis. • The histogram is used for variables whose values are numerical and

measured on an interval/ratio scale. It is generally used when dealing with large data sets (greater than 100 observations).

• A histogram can also help detect any unusual observations (outliers) or any gaps in the data.

• Disadvantages

• Cannot read frequencies of exact values because data is divided into class intervals.

• Cannot be drawn for two variables together.

• Use only with continuous data and rarely with discrete data.

Difference between Bar Charts and Histograms

Frequency Polygon

• A frequency polygon is a graph drawn by joining the midpoints of histogram column tops.

• Frequency Polygon is used only when depicting data from a continuous variable.

Stem and leaf plot

• A stem plot is used to display quantitative data. • a stem and leaf plot, does show exact values of individual observations. • Advantages • Concise representation of data • Shows range, minimum & maximum, gaps & clusters, and outliers easily • Can handle extremely large data sets • Shows the shape of the distribution • When observations are limited, can easily construct it manually

• Disadvantages • Not visually appealing • When there is a large number of observations it is difficult to manage

manually.

Box and whisker plot

• A box plot, sometimes called a box and whisker plot, is a quick graphic approach for examining one or more sets of continuous data.

• Demonstrate following values – The median

– The upper quartile and lower quartile

– The maximum and minimum

– Outliers

Advantages of Box plot

1. Graphically display a continuous variable’s important locations (quartiles, median) and spread at a glance. 2. Provide some indication of the data symmetry and skewness. 3. Unlike many other methods of data display, box plot show outliers. 4. When the box plots are drawn for a certain variable for different categories and placed side by side, it is easy to compare the statistical measures between categories (as shown in the figure above).

Disadvantages of Box plot

• They tend to emphasize the tails of the distribution, which may be the least important points in a data set.

• They hide some details of a distribution, such as the shape of the distribution, the area under the curve etc.

Scatter plot

• Scatter plot is a two dimensional plot with one of the variable plotted in the horizontal axis (X axis- independent variable) and other with the vertical (Y axis – dependent variable) axis.

• The pattern of the data points on the scatter plot reveals the nature of the relationship between the two variables.

• positive relationships between variables

• negative relationships between variables

• scattered data points (without any relationships)

• non-linear patterns

• outliers.

Selecting an appropriate graphical was of presenting data

Thank You!!!