220ct coursework question #1: database design (this task ... · iss-3664 526 esa eu 16/01/2014 bio...

28
220CT Coursework Question #1: Database Design (This task is worth 20 marks) The International Space Station (ISS) is a habitable artificial satellite in low Earth orbit. It is the ninth space station to be inhabited by crews following previous orbital stations that were launched by the US, the former Soviet Union and later Russia. The ISS is intended to be a laboratory, observatory and factory in space as well as to provide transportation, maintenance, and act as a staging base for possible future missions to the Moon, Mars and beyond. In order to support the crew and overall operation of ISS the space agencies in charge of running the station conduct regular missions to launch spacecraft carrying payloads of essential or replacement equipment up to ISS. A payload inventory, see table below, is recorded of each mission, consisting of the space agency leading the mission and the equipment payload to be sent up to ISS. The overall weight of the payload is also determined in order to calculate the fuel needed for orbital insertion of the spacecraft to successfully rendezvous with ISS. Mission No. Agcy No. Lead Agency Country Mission Date Equipment Qty Item Weight Total Weight ISS-2237 178 JAXA Japan 14/12/2013 Potable water dispenser 2 100kg 211kg Flexible air duct 6 0.5kg Small storage Rack 4 2kg ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg 69kg Battery pack 2 5kg Urine transfer tubing 2 1.5kg O2 scrubber 1 50kg ISS-1234 032 Roskosmos Russia 16/04/2014 Small storage Rack 1 2kg 2.5kg Flexible air duct 2 0.5kg Currently there is no database being used for managing the payload inventory information in the table above. This task is split up into two parts:

Upload: others

Post on 30-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

220CT Coursework

Question #1: Database Design (This task is worth 20 marks)

The International Space Station (ISS) is a habitable artificial satellite in low Earth orbit. It is the ninth space station to be inhabited by crews following previous orbital stations that were launched by the US, the former Soviet Union and later Russia. The ISS is intended to be a laboratory, observatory and factory in space as well as to provide transportation, maintenance, and act as a staging base for possible future missions to the Moon, Mars and beyond. In order to support the crew and overall operation of ISS the space agencies in charge of running the station conduct regular missions to launch spacecraft carrying payloads of essential or replacement equipment up to ISS. A payload inventory, see table below, is recorded of each mission, consisting of the space agency leading the mission and the equipment payload to be sent up to ISS. The overall weight of the payload is also determined in order to calculate the fuel needed for orbital insertion of the spacecraft to successfully rendezvous with ISS.

Mission No.

Agcy No. Lead Agency Country Mission Date Equipment Qty Item

Weight

Total

Weight

ISS-2237 178 JAXA Japan 14/12/2013 Potable water dispenser

2 100kg 211kg

Flexible air duct 6 0.5kg

Small storage Rack 4 2kg

ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg

ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg 69kg

Battery pack 2 5kg

Urine transfer tubing 2 1.5kg

O2 scrubber 1 50kg

ISS-1234 032 Roskosmos Russia 16/04/2014 Small storage Rack 1 2kg 2.5kg

Flexible air duct 2 0.5kg

Currently there is no database being used for managing the payload inventory information in the

table above.

This task is split up into two parts:

Page 2: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

1. In its current form, it’s a traditional DB. Keep it that way? Your call. Explain your decision.

For this example I would put the database in a traditional relational database form because it

is a small dataset and can be split into tables, such as Mission, Agency and Equipment, using

normalization. This database does not need to be flexible, therefore SQL’s rigid schema

would be suitable for this example (Albodour 2015). Furthermore this example does not

require fast response times and therefore a traditional rational database can be used.

2. Design the database for the information above. (Hint- relationships? Tables? Data?) And then

Implement the DB using the method of your choice (SQL, MongoDB, CassandraDB or

GraphDB).

Normalization

1NF

The original database was not in 1NF because ‘Equipment’ had multiple values. Therefore the

solution was to create an ‘Equipment’ table with a copy of the key from the un-normalised table.

Mission No. (*) Equipment Qty Item

Weight

ISS-2237 Potable water dispenser 2 100kg

ISS-2237 Flexible air duct 6 0.5kg

ISS-2237 Small storage Rack 4 2kg

ISS-3664 Bio filter 6 0.20kg

ISS-2356 Small storage Rack 3 2kg

ISS-2356 Battery pack 2 5kg

ISS-2356 Urine transfer tubing 2 1.5kg

ISS-2356 O2 scrubber 1 50kg

ISS-1234 Small storage Rack 1 2kg

ISS-1234 Flexible air duct 2 0.5kg

Mission No. Agcy No. Lead Agency Country Mission Date Total

Weight

ISS-2237 178 JAXA Japan 14/12/2013 211kg

ISS-3664 526 ESA EU 16/01/2014 1.20kg

ISS-2356 167 NASA USA 12/02/2014 69kg

ISS-1234 032 Roskosmos Russia 16/04/2014 2.5kg

Page 3: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

2NF

To transform the data in the 1NF into 2NF, any non-key attributes that only depend on part of the

table key have to be put into a new table. Therefore a new table must be created to hold the

equipment and item weight.

3NF

In order to transform from 2NF to 3NF, you must remove any non-key attributes that are more

dependent on other non-key attributes than the table key and place them in a new table. Therefore

an agency table must be created. The equipment table does not have a primary key because there

are no unique attributes, therefore two foreign keys are required to link the table to the inventory

and mission tables.

Mission No. Agcy No. Lead Agency Country Mission Date Total

Weight

ISS-2237 178 JAXA Japan 14/12/2013 211kg

ISS-3664 526 ESA EU 16/01/2014 1.20kg

ISS-2356 167 NASA USA 12/02/2014 69kg

ISS-1234 032 Roskosmos Russia 16/04/2014 2.5kg

Mission No.(*) Equipment (*) Qty

ISS-2237 Potable water dispenser 2

ISS-2237 Flexible air duct 6

ISS-2237 Small storage Rack 4

ISS-3664 Bio filter 6

ISS-2356 Small storage Rack 3

ISS-2356 Battery pack 2

ISS-2356 Urine transfer tubing 2

ISS-2356 O2 scrubber 1

ISS-1234 Small storage Rack 1

ISS-1234 Flexible air duct 2

Equipment Item

Weight

Potable water dispenser 100kg

Flexible air duct 0.5kg

Small storage Rack 2kg

Bio filter 0.20kg

Battery pack 5kg

Urine transfer tubing 1.5kg

O2 scrubber 50kg

Mission No. Agcy No. (*) Mission Date Total

Weight

ISS-2237 178 14/12/2013 211kg

ISS-3664 526 16/01/2014 1.20kg

ISS-2356 167 12/02/2014 69kg

ISS-1234 032 16/04/2014 2.5kg

Page 4: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Equipment Item

Weight

Potable water dispenser 100kg

Flexible air duct 0.5kg

Small storage Rack 2kg

Bio filter 0.20kg

Battery pack 5kg

Urine transfer tubing 1.5kg

O2 scrubber 50kg

E-R Diagram

The entities in the E-R diagram are mission, agency, equipment and inventory. The relationships

between the entities are:

An agency must have at least one mission but further mission may occur later on so

therefore it could have many.

A mission must have at least one equipment item or it can have many.

Equipment must have at least one item from the inventory or it can have many.

Agcy No. Lead Agency Country

178 JAXA Japan

526 ESA EU

167 NASA USA

032 Roskosmos Russia

Mission No. (*) Equipment (*) Qty

ISS-2237 Potable water dispenser 2

ISS-2237 Flexible air duct 6

ISS-2237 Small storage Rack 4

ISS-3664 Bio filter 6

ISS-2356 Small storage Rack 3

ISS-2356 Battery pack 2

ISS-2356 Urine transfer tubing 2

ISS-2356 O2 scrubber 1

ISS-1234 Small storage Rack 1

ISS-1234 Flexible air duct 2

MISSION

INVENTORY

AGENCY

EQUIPMENT

Includes

Part_of Has

Included_on

Included_on Includes

Page 5: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Implementing in SQL

CREATE TABLE mission(

mis_no VARCHAR2(8) PRIMARY KEY,

agency_no NUMBER(3) NOT NULL,

mis_date DATE NOT NULL,

total_weight NUMBER(5,2) NOT NULL);

CREATE TABLE inventory(

item VARCHAR2(23) PRIMARY KEY,

item_weight NUMBER(5,2) NOT NULL);

CREATE TABLE equipment(

mis_no VARCHAR2(8) NOT NULL,

item VARCHAR2(23) NOT NULL,

quantity NUMBER(1) NOT NULL);

CREATE TABLE agency(

agency_no NUMBER(3) PRIMARY KEY,

lead_agency VARCHAR2(9) NOT NULL,

country VARCHAR2(6) NOT NULL);

ALTER TABLE mission

ADD CONSTRAINT AGENCY_NO_FK FOREIGN KEY (AGENCY_NO) REFERENCES agency (AGENCY_NO);

ALTER TABLE equipment

ADD CONSTRAINT MIS_NO_FK FOREIGN KEY (MIS_NO) REFERENCES mission (MIS_NO);

ALTER TABLE equipment

ADD CONSTRAINT ITEM_FK FOREIGN KEY (ITEM) REFERENCES inventory (ITEM);

Page 6: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

INSERT INTO agency VALUES (178, 'JAXA', 'Japan'); INSERT INTO agency VALUES (526, 'ESA', 'EU'); INSERT INTO agency VALUES (167, 'NASA', 'USA'); INSERT INTO agency VALUES (032, 'Roskosmos', 'Russia'); INSERT INTO mission VALUES ('ISS-2237', 178, '14-Dec-13', 211); INSERT INTO mission VALUES ('ISS-3664', 526, '16-Jan-14', 1.20); INSERT INTO mission VALUES ('ISS-2356', 167, '12-Feb-14', 69); INSERT INTO mission VALUES ('ISS-1234', 032, '16-Apr-14', 2.5); INSERT INTO inventory VALUES ('Potable water dispenser', 100); INSERT INTO inventory VALUES ('Flexible air duct', 0.5); INSERT INTO inventory VALUES ('Small storage rack', 2); INSERT INTO inventory VALUES ('Bio filter', 0.20); INSERT INTO inventory VALUES ('Battery Pack', 5); INSERT INTO inventory VALUES ('Urine transfer tubing', 1.5); INSERT INTO inventory VALUES ('O2 Scrubber', 50); INSERT INTO equipment VALUES ('ISS-2237', 'Potable water dispenser', 2); INSERT INTO equipment VALUES ('ISS-2237', 'Flexible air duct', 6); INSERT INTO equipment VALUES ('ISS-2237', 'Small storage rack', 4); INSERT INTO equipment VALUES ('ISS-3664', 'Bio filter', 6); INSERT INTO equipment VALUES ('ISS-2356', 'Small storage rack', 3); INSERT INTO equipment VALUES ('ISS-2356', 'Battery Pack', 2); INSERT INTO equipment VALUES ('ISS-2356', 'Urine transfer tubing', 2); INSERT INTO equipment VALUES ('ISS-2356', 'O2 Scrubber', 1); INSERT INTO equipment VALUES ('ISS-1234', 'Small storage rack', 1); INSERT INTO equipment VALUES ('ISS-1234', 'Flexible air duct', 2);

Page 7: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Query Examples

1. Produce a list of missions in descending order of their total weight

2. Produce a list of all the missions that have a total weight less than 69kg

3. Produce a list of all the missions carried out in between 01-Dec-2013 and 20-Feb-2014

Page 8: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Question 3– A data mining system for a bank (This task is worth 25 Marks)

A bank has been collecting a great deal of data on their customers and have heard that use of data

mining could increase their competitiveness. They would like you to create a brief report that

includes the following.

i. What data mining is and an appropriate application for the bank.

One definition for data mining is “the nontrivial process of identifying valid, novel,

potentially useful and ultimately understandable patterns in data” (Fayyad, Piatetsky-

Shapiro & Smyth 1996). Another definition for data mining is the “process of analysing

data from different perspectives and summarizing” it into beneficial information (Frand

n.d). Data mining takes large data sets and discovers patterns to make the data into

something understandable, which can be used to generate new business for

organizations. Data mining would be beneficial for the bank because the bank could use

data mining to detect fraud, access credit risk applications and to tailor specific products

towards customers (Moin & Ahmed n.d)

ii. How you would go about creating the system using the data mining lifecycle below.

Problem Definition

The aim is to maintain customer loyalty, advertise specific products to customers and

increase number of customers, by determining which products should be advertised to

specific customers.

Data Gathering and Preparation

The bank should use the following attributes: the person’s current status of their

checking account, their credit history, savings account, employment history, job, personal

status, age and purpose. From this the bank can create a case table for mining. A data

sample is not required because there are only 1000 instances in the database. However if

Page 9: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

there were over 2000 instances then a data sample could be used because a

representative random sample is more efficient to mine, which is therefore more cost-

effective and the results produced are similar to those produced by an entire database

(University of Nevada 2003).

Model Building and Evaluation

Data mining is required to analyse patterns in the customer’s transactions, loans and if a

specific insurance or new bank account is set up by a certain age range. For example

customers who are in their 30s or above are expected to be buying or own a house, and

would therefore require home insurance. This data could be used to advertise home

insurance to these customers. A model could be created using clusters for age which

could be used to determine what services should be targeted at specific ages, such as

student accounts for young people. Data mining could be used to determine if a

customer would require a loan, especially if they are self-employed as they may need to

buy supplies or are looking to expand their workforce. Also data mining could be used to

show which bank accounts are more popular and these findings could be used to entice

new customers. Furthermore to increase the number of customers, data mining could be

used to find out why customers switched to another bank and change their own bank

offers to attract new customers (Bhasin 2006). Forecasting could also be used to predict

if a customer is going to transfer to another bank by looking at the customer’s previous

transactions and if they are no longer putting money into their savings account.

Use Knowledge

From the results, a report would be produced to outline the findings of the model. This

could then be used to increase the competitiveness of the bank because they would be

able to market specific services towards specific customers based on the patterns found

from data mining.

iii. Whether the small amount of data (credit.arff) collected so far by the bank, to see

whether you feel that they are collecting the right data for the task of assessing credit

worthiness.

Applications are evaluated based on the 5 C’s of credit. The five C’s are the following:

1. Character – assesses the individual’s willingness and ability to repay the loan.

Therefore the attributes credit history, checking account, savings account and their

job will need to be analysed.

2. Capital – assess the individual’s investment in a business or project. The attributes

property and housing will need to be assessed.

3. Capacity – is the assessment of the individual’s ability to repay the loan with their

current financial means. An individual’s savings account, checking account and credit

history will need to be evaluated.

4. Conditions – measures the overall economic environment against the individual’s

ability to repay loans.

5. Collateral – is in the event if the individual cannot repay the loan (M&T Bank 2015).

The data collected so far by the bank is the right data for assessing credit worthiness

because the person’s current status of their checking account, their credit history,

Page 10: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

savings account, employment history, job and purpose can be used to determine

whether an individual should be allowed a loan or if they should be worthy of a

favourable rate (Investopedia n.d). However if they are a foreign worker or if they have a

telephone or not, are not relevant to assessing credit worthiness and should therefore be

ignored. The attributes personal status and sex and age cannot be used to determine

whether an individual should be given credit or not, because the Federal Trade

Commission enforced the Equal Credit Opportunity Act which prevents the

discrimination of sex, marital status and age when determining if you are credit worthy

(Federal Trade Commission 2013).

iv. The use of a data mining model such as a multilayer perceptron or decision tree to

determine a person’s credit worthy. Note, you will need to use a data mining tool like

WEKA to create your model and use the credit.arff data to train and test this model.

Decision Tree with All Attributes

Summary

Correctly Classified Instances 705 70.5%

Incorrectly Classified Instances 295 29.5%

Total Number of Instances 1000

Confusion Matrix a b <-- classified as 588 112 | a = good 183 117 | b = bad Therefore based on this model 771 people are credit worthy and 229 people are not.

Page 11: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Multi-Layer Perception Tree with All Attributes

Summary

Correctly Classified Instances 715 71.5%

Incorrectly Classified Instances 285 28.5%

Total Number of Instances 1000

Confusion Matrix a b <-- classified as 561 139 | a = good 146 154 | b = bad Therefore based on this model 707 people are credit worthy and 293 people are not.

Page 12: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Decision Tree with Foreign Worker, Telephone, Personal Status and Age Attributes Removed

Summary

Correctly Classified Instances 711 71.1%

Incorrectly Classified Instances 289 28.9%

Total Number of Instances 1000

Confusion Matrix a b <-- classified as 579 121 | a = good 168 132 | b = bad Overall based on this model 747 people are credit worthy and 253 people are not.

Page 13: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Multi-Layer Perception Tree with Foreign Worker, Telephone, Personal Status and Age Attributes

Removed

Summary

Correctly Classified Instances 710 71%

Incorrectly Classified Instances 290 29%

Total Number of Instances 1000

Confusion Matrix a b <-- classified as 569 131 | a = good 159 141 | b = bad Overall 728 people are credit worthy and 272 people are not credit worthy.

Page 14: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Actual Credit Worthiness Results

I chose to represent the data as a decision tree and a multi-layer perception tree to compare the

models. A decision tree is a structure that represent sets of decisions, whereas a multi-layer

perception tree produces a neural network which are non-linear predictive models that learn

through training (Frand n.d). Despite the multi-layer perception tree of all attributes having the

highest accuracy of classifying the instances, we are not allowed to use this model to determine

credit worthiness due to the Equal Credit Opportunity Act. Therefore from comparing the decision

tree and multi-layer perception tree with some of the attributes removed, I believe the best model is

a decision tree to determine a person’s credit worthiness. The visualisation, provided by the decision

tree, clearly shows the decision pathways when calculating the credit worthiness of a person. From

looking at the results from the decision tree with some of the attributes removed, the number of

credit worthy people is 747, because they are classified as good in the model. We are then left with

253 people who are not credit worthy. The results produced from this decision tree have been 71.1%

correctly classified compared to the 71% classified correctly in the multi-layer perception tree. The

results from the decision tree can be compared to the actual credit worthiness results, and from this

we can see that 47 instances have been incorrectly classified as credit worthy, because the actual

number of people not credit worthy is 300 instead of 253. However I would not recommended this

model to be used in determining credit worthiness in the real-world because the percentage of

correctly classified instances is only 71.1%, and is therefore not a highly accurate model to be used.

Page 15: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Question 4: Big Data Idea (30 marks)

Aim and Objectives

Aim

The aim of this project was to explore the crimes carried out in San Francisco between 6th January

2003 and 17th November 2015, in order to gain a clear understanding of how crime has changed over

the years.

Objectives

Collect the data from the City and County of San Francisco

Create a dataset using Excel

Analyse the data

Visualisation of the data using Excel and Google Fusion Tables

Results and findings from the analysis

Future developments

Background

The history of crime is one of San Francisco’s

tourist attractions. Alcatraz is a popular tourist

attraction and was a federal prison from 1934

to 1963. It held notorious convicts such as Al

“Scarface” Capone and Robert “Birdman of

Alcatraz” Stroud (San Francisco Travel n.d)

(History.com 2009). Alcatraz never had any

reported prisoners escape however three

prisoners, Clarence and John Anglin and Frank

Morris managed to construct a raft and set sail

but were never found, and were therefore

presumed dead from drowning. (SF Gate

2013)

San Francisco has one of highest crime rates in

America and the overall crime rate is 114% higher than the national average (Area Vibes 2013).

Violent crimes and property crimes are major contributors to San Francisco’s overall crime rate

(Neighbourhood Scout n.d). The crime rate in San Francisco has risen in recent years whilst the

number of arrests has declined, and numbers of police staff has also decreased (SF Examiner 2015).

Reason why I picked this Project

The analysis of the San Francisco dataset will provide an understanding of the crimes committed

between 2009 and 2015. This analysis could then be used in future forecasting to identify trends in

how crime has changed or which crime is most active in certain areas. The results could then be used

to target specific areas to lower the crime rate.

(Google Maps, 2015)

Page 16: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Acquiring the Data

I gathered the dataset from SF OpenData and I was able to view the data in Excel. The dataset

contains 1,842,719 instances of crimes between 1th January 2003 and 17th November 2015. In the

original dataset the attributes location and PdId were included and I removed these attributes,

because I would not be using these attributes for analysis. Once this was completed I added filters to

every column which would allow efficient filtering for a specific result, such as a specific category. For

example if I wanted all the crimes which were a robbery, the result would be all the instances with

robbery as their category. Then I filtered the data by only selecting the data from 2009 onwards

because this is the data I want to look at for analysis.

(Original dataset above and the edited dataset is below)

Page 17: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Description of Attributes

IncidntNum – a unique incident number

Category o Arson o Assault o Bad Checks o Bribery o Burglary o Disorderly Conduct o Driving Under the Influence o Drug/Narcotic o Drunkenness o Embezzlement o Extortion o Family Offenses o Forgery/Counterfeiting o Fraud o Gambling o Kidnapping o Larceny/Theft o Liquor Laws o Loitering o Missing Person o Non-Criminal o Other Offenses o Pornography/Obscene Mat o Prostitution o Recovered Vehicle o Robbery o Runaway o Secondary Codes o Sex Offenses Forcible o Sex Offenses Non Forcible o Stolen Property o Suicide o Suspicious OCC o Trea o Trespass o Vandalism o Vehicle Theft o Warrants o Weapon Laws

Descript – a description of the crime

DayOfWeek – Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday

Date – written in DD-MM-YYYY format

Time – written in HH:TT format

PdDistrict – name of the Police Department district o Bayview o Central o Ingleside

Page 18: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

o Mission o Northern o Park o Richmond o Southern o Taraval o Tenderloin

Resolution – how the crime incident was resolved o Arrest, Booked o Arrest, Cited o Cleared – Contact juvenile for more info o Complainant refuses to prosecute o District attorney refuses to prosecute o Exceptional clearance o Juvenile Admonished o Juvenile Booked o Juvenile Cited o Juvenile Diverted o Located o None o Not prosecuted o Prosecuted by outside agency o Prosecuted for lesser offense o Psychopathic case o Unfounded

Address – approximate street address where the crime incident took place

X - longitude

Y – latitude

Page 19: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Analysis with Visualisation

Total Number of Each Crime from 2009-2015

I decided to first look at the total number of each crime in the 6 years and represent the results as a

bar graph, because it provides a simple visualisation of which crimes are most prevalent in San

Francisco. In order to get the total number of incidents for each crime, I used the function =COUNTIF

(range, criteria) in Excel. For this function the range was B2:B1024989 and the criteria was each

specific category e.g. =COUNTIF (B2:B1024989, “Assault”) would give me the total of 89,144.

From this analysis the top three crimes can be identified, which are larceny/theft, other offences and

non-criminal incidents.

225898148270

11942589144

5346353521

4596542392

3961840648

3164826390

1979113497

10548853680076389549564695097476836613264297618811931172812108225575714223233311601412110

0 50000 100000 150000 200000 250000

Larceny/TheftOther Offenses

Non-CriminalAssault

Drug/NarcoticVandalismWarrants

Vehicle TheftSuspicious OCC

BurglaryMissing Person

RobberyFraud

Secondary CodesWeapon Laws

TrespassForgery/Counterfeiting

Stolen PropertySex Offenses, Forcible

ProstitutionDrunkenness

Disorderly ConductRecovered Vehicle

Driving Under the InfluenceKidnapping

Liquor LawsRunaway

ArsonEmbezzlement

LoiteringFamily Offenses

SuicideBribery

Bad ChecksExtortion

Sex Offenses, Non ForcibleGambling

Pornography/Obscene MatTrea

Number of Incidents

Typ

e o

f C

rim

e

Total Number of Each Crime in the 6 Years

Page 20: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Percentage of Each Resolution from 2009-2015

After looking at the total number of each crime, I chose to look at the total resolutions over the 6

years, because this would provide an understanding of how successful the police departments and

the courts were of sentencing the offender. I chose to represent the information as a pie chart

because the resolutions could be represented as a percentage, which I believe is the best way to

compare the different resolutions. This is because you can easily see which resolutions are the most

popular.

This pie chart shows us that the most common successful resolutions are arrested and booked or

cited. However 60.9% of all incidents over the 6 years did not have a resolution which indicates a lack

of evidence for prosecution or the offender was let off with a warning.

23.40%

7.86% 0.03%

0.58%

0.32%

0.30%

0.15%

0.65%

0.37%

0.04%

1.96%

60.90%

0.10%

0.25% 0.00%

2.00%

1.08%

Percentage of each Resolution over the 6 Years

Arrest, Booked

Arrest, Cited

Cleared – Contact juvenile for more infoComplainant refuses toprosecuteDistrict attorney refuses toprosecuteExceptional clearance

Juvenile Admonished

Juvenile Booked

Juvenile Cited

Juvenile Diverted

Located

None

Not prosecuted

Prosecuted by outside agency

Prosecuted for lesser offense

Psychopathic case

Unfounded

Page 21: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Crime Comparison

I have chosen the years 2009, 2012 and 2015 to compare each type of crime so I can see if there is a

trend. Also the scatter graph displays which crimes occur the most in a specific year, such as in 2015

there has been a significant rise in larceny/theft compared to 2012 and 2009. However we must take

into consideration that the data for 2015 is only up to the 17th November 2015, and is therefore not a

complete representation for the whole year.

The Category Numbers are as follows:

1. Arson 11. Extortion 21. Non-Criminal 31. Stolen Property

2. Assault 12. Family Offenses 22. Other Offenses 32. Suicide

3. Bad Checks 13. Forgery/Counterfeiting 23. Pornography/Obscene Mat

33. Suspicious OCC

4. Bribery 14. Fraud 24. Prostitution 34. Trea

5. Burglary 15. Gambling 25. Recovered Vehicle 35. Trespass

6. Disorderly Conduct 16. Kidnapping 26. Robbery 36. Vandalism

7. Driving Under the Influence

17. Larceny/Theft 27. Runaway 37. Vehicle Theft

8. Drug/Narcotic 18. Liquor Laws 28. Secondary Codes 38. Warrants

9. Drunkenness 19. Loitering 29. Sex Offenses Forcible 39. Weapon Laws

10. Embezzlement 20. Missing Person 30. Sex Offenses Non Forcible

0

2500

5000

7500

10000

12500

15000

17500

20000

22500

25000

27500

30000

32500

35000

37500

40000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

Nu

mb

er o

f In

cid

ents

Category Number

Crime Comparison

2009

2012

2015

Page 22: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Resolution Comparison

I decided to represent the resolution comparison of 2009, 2012 and 2015 as a bar graph, because the

number of instances for each type of resolution, for each year are side by side which enables quick

comparison between the results.

From the graph above, the top 3 resolutions for all 3 years are arrested and booked or cited, and no

resolution. However there has been a 25.6% rise in the number of no resolutions because in 2015

there was 96,669 incidents with no resolution compared to 76,947 incidents with no resolution in

2009.

37

62

8

16

22

2

21 35

4

72

9

21

1

24

4

69

5

49

3

49 2

04

3

76

94

7

13

6

51

0

11 2

26

1

13

07

28

71

6

10

65

2

33 37

8

58

2

25

2

23

7

90

4

69

8

68

34

70

89

98

3

19

4

34

4

0

31

56

11

96

34

62

9

92

7

16

3

4 0 68

5

1 90

6

6 3 13

2

96

66

9

6 0 0 13

6 22

04

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

55000

60000

65000

70000

75000

80000

85000

90000

95000

100000

Nu

mb

er o

f In

cid

ents

Resolution

Resolution Comparison

2009

2012

2015

Page 23: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Crimes with the Most Change

I used line graphs to show the change in the number of incidents for a particular crime because they

show the rise and fall over the years. I chose to represent the change of three different crimes, which

are drug/narcotics, larceny/theft and other offenses, by selecting the crimes with the most difference

in the scatter graph for all crimes within all three years.

From the graph above there is a clear indication that shows that drug and narcotic crimes are on the

decline, and from 2009 to 2015 there has been a 68.9% decrease which further indicates a decline.

The number of larceny/theft incidents has considerably increased over the 6 years. Between 2009

and 2012 there was a 21% increase but between 2012 and 2015 there was a 20.7%, which suggests

that the number of larceny/theft incidents is growing at a steady rate.

11950

6447

3715

0

2000

4000

6000

8000

10000

12000

14000

Nu

mb

er o

f In

cid

ents

Year

Change in Drug/Narcotic

Drug/Narcotic

25584

30973

37393

0

5000

10000

15000

20000

25000

30000

35000

40000

Nu

mb

er o

f In

cid

ents

Year

Change in Larceny/Theft

Larceny/Theft

2009 2012 2015

2009 2012 2015

Page 24: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

The number of other offense incidents has greatly declined between 2009 and 2012 whilst between

2012 and 2015 there has only been a reduction of 1,056 incidents. However we must take into

account that 2015 incidents are only up until the 17th November and therefore incidents after this

date will change the result.

Crime in San Francisco in 2015

I imported the dataset into Google Fusion Tables and modified the X and Y attributes to be a location

data type, where Y represents the latitude and X represents the longitude (Google 2015) (Google

2015a). Then I filtered the crime incidents to just show incidents between 01/01/2015 and

17/11/2015.

The map below represents all the crimes carried out in 2015 and each red dot is the location where

each incident took place. This provides a geographical view of where crime is most prevalent by the

density of the red dots and where crime is less prevalent where the red dots are more distributed.

24690

1864617590

0

5000

10000

15000

20000

25000

30000

Nu

mb

er o

f In

cid

ents

Year

Change in Other Offenses Incidents

Other Offenses

2009 2012 2015

Page 25: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

I decided to look at larceny/theft as a sample for what the map could be used for. You can filter the

results further by selecting a specific a police department that responded to the crime, which

indicates the location the crime occurred. Below is the map that represents all the larceny/theft

incidents that took place in 2015.

Then I filtered the result to show the police department that responded to the most incidents, which

was the Northern Police Department. This map shows that the crimes are carried out in very similar

locations.

Page 26: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

Next I filtered the result to show the police department that responded to the fewest incidents,

which was the Ingleside Police Department. The map below indicates that there is no distinct pattern

where the crimes take place, because the red dots are distributed and are not clustered in a specific

area.

Results and Findings

In conclusion larceny/theft, other offenses and non-criminal crimes are the most common crimes

carried out in San Francisco, with larceny/theft being a major contributor because there have been

225,898 incidents within the 6 years. Also 60.9% of all incidents did not have a resolution which

suggests the offender was let off with a warning. Therefore some offenders may repeat a crime

because they are not deterred from carrying out a crime, which could have had an impact on the

total incident numbers for each crime category. This could be analysed if further data was provided.

Overall from the visualisations of the change between the years 2009, 2012 and 2015 there is a clear

indication that some type of crimes, such as drug /narcotic incidents and other offenses incidents are

on the decline. On the other hand crimes such as larceny/theft incidents are on the rise because

between 2012 and 2015 there has been a 20.7% increase.

From the map visualisation of all crimes carried out in 2015 it shows that there is a high crime rate in

the North East of San Francisco particularly in the area policed by the Northern Police Department.

Whereas in the Western areas of San Francisco they have a lower crime rate.

Future Developments

The mapping of each crime could be used to target specific areas to raise awareness in the

community and to increase the number of police staff. These areas could also have additional

surveillance in order to reduce the crime rate. Also the results could be filtered further to show the

crimes with the same M.O. using the incident numbers. This could be used to find patterns in where

an offender feels comfortable in carrying out a crime and be used in criminal profiling. Furthermore

the mapping of each crime could be used when people are deciding where to live because the crime

rate would influence their decision.

Page 27: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

List of References

Albodour, R. (2015) MongoDB Part 1 [online lecture] module 220CT, 16 November 2015. Coventry:

Coventry University. Available from < https://prezi.com/3ax4trxxq4z6/mongodb-part-

1/?utm_campaign=share&utm_medium=copy> [20 November 2015]

Area Vibes (2013) San Francisco, CA Crime Rates and Statistics [online] available from

<http://www.areavibes.com/san+francisco-ca/crime/> [1 December 2015]

Bhasin, M. L (2006) ‘Data Mining: A Competitive Tool in the Banking and Retail Industries’, The

Chartered Accountant [online] available from

<https://www.academia.edu/17141409/Data_Mining_A_Competitive_Tool_in_the_Banking>

[21 November 2015]

Fayyad. U, Piatetsky-Shapiro, G & Smyth, P (1996) ‘From Data Mining to Knowledge Discovery in

Databases’, AI Magazine [online] available from

<https://www.aaai.org/ojs/index.php/aimagazine/article/viewFile/1230/1131>

[21 November 2015]

Federal Trade Commission (2013) Your Equal Opportunity Rights [online] available from

<http://www.consumer.ftc.gov/articles/0347-your-equal-credit-opportunity-rights>

[22 November 2015]

Frand. J (n.d) Data Mining [online] available from

<http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/dataminin

g.htm> [22 November 2015]

Google (2015) About Fusion Tables [online] available from

<https://support.google.com/fusiontables/answer/2571232> [30 November 2015]

Google (2015a) Create a Map: Fusion Tables [online] available from

<https://support.google.com/fusiontables/answer/2527132?hl=en&topic=2573107&ctx=topi

c#mapsample> [30 November 2015]

Google Maps (2015) San Francisco [online] available from

<https://www.google.co.uk/maps/place/San+Francisco,+CA,+USA/@37.7576171,-

122.5776844,11z/data=!3m1!4b1!4m2!3m1!1s0x80859a6d00690021:0x4a501367f076adff>

[3 December 2015]

History.com (2009) Alcatraz Island [online] available from

<http://www.history.com/topics/alcatraz> [1 December 2015]

Investopedia (n.d) Creditworthiness [online] available from

<http://www.investopedia.com/terms/c/credit-worthiness.asp> [22 November 2015]

M&T Bank (2015) The 5C’S of Credit [online] available from

<https://www.mtb.com/business/businessresourcecenter/Pages/FiveC.aspx>

[22 November 2015]

Moin, K. I & Ahmed, Q. B (n.d) ‘Use of Data Mining in Banking’, International Journal of Engineering

Research and Applications [online] available from

<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.416.7821&rep=rep1&type=pdf>

[21 November 2015]

Neighbourhood Scout (n.d) San Francisco Crime [online] available from

<http://www.neighborhoodscout.com/ca/san-francisco/crime/> [2 December 2015]

San Francisco Travel (n.d) Alcatraz [online] available from

<http://www.sanfrancisco.travel/alcatraz> [1 December 2015]

Page 28: 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg

SF Examiner (2015) San Francisco Crime Rate Jumps Despite Fewer Arrests [online] available from

<http://www.sfexaminer.com/sf-crime-rate-jumps-despite-fewer-arrests/>

[2 December 2015]

SF Gate (2013) The 16 Most Infamous Crimes in Bay Area History [online] available from

<http://www.sfgate.com/crime/slideshow/The-16-most-infamous-crimes-in-Bay-Area-

history-72881/photo-3048055.php> [1 December 2015]

SF OpenData (2015) SFPD Incidents – From 1 January 2003 [online] available from

<https://data.sfgov.org/data?category=&dept=&search=sfpd%20incidents&type=dat

asets [28 November 2015]

University of Nevada (2003) Preparing Data for Data Mining [online] available from

<http://www.cabnr.unr.edu/gf/dm/chap02.pdf> [21 November 2015]