data and business process intelligence
TRANSCRIPT
DATA AND BUSINESS PROCESS
INTELLIGENCE
PENTAHO PLATFORM
DEVELOPED AT:
BHAT, GANDHINAGAR-382428
DEVELOPED BY:
BHAGAT FARIDA H. SINGH SWATI
11ITUOS079 11ITUOS068
GUIDED BY:
INTERNAL GUIDE EXTERNAL GUIDE
PROF. R.S. CHHAJED MR. VIJAY PATEL
Department of Information Technology.
Faculty of Technology,
Dharmsinh Desai University,
College Road, Nadiad- 387001.
CANDIDATE’S DECLARATION
We declare that the final semester report entitled “DATA AND BUSINESS PROCESS
INTELLIGENCE” is our own work conducted under the supervision of the external
guide MR. Vijay Patel, Institute for Plasma Research, Bhat, Gandhinagar and internal
guide Prof. R.S. Chhajed, Faculty of Technology, DDU, Nadiad.
We further declare that to the best of our knowledge the report for B.TECH SEM-VIII
does not contain part of the work which has been submitted either in this or any other
university without proper citation.
Farida Bhagat H.
Branch: Information Technology
Student ID: 11ITUOS079
Roll: IT-07
Singh Swati
Branch: Information Technology
Student ID: 11ITUOS068
Roll: IT-124
Submitted To:
PROF. R.S. CHHAJED,
Department of Information Technology,
Faculty of Technology,
Dharmsinh Desai University,
Nadiad
DDU (Faculty of Tech., Dept. of IT) i
DDU (Faculty of Tech., Dept. of IT) ii
DHARMSINH DESAI UNIVERSITYNADIAD-387001, GUJARAT
CERTIFICATE
This is to certify that the project entitled “DATA AND BUSINESS PROCESS
INTELLIGENCE” is a bonafied report of the work carried out by
1) Miss BHAGAT FARIDA H., Student ID No: 11ITUOS079
2) Miss SINGH SWATI, Student ID No: 11ITUOS068
of Department of Information Technology, semester VIII, under the guidance and
supervision for the award of the degree of Bachelor of Technology at Dharmsinh Desai
University, Gujarat. They were involved in Project training during academic year 2013-
2014.
Prof. R.S.Chhajed
HOD, Department of Information Technology,
Faculty of Technology,
Dharmsinh Desai University, Nadiad
Date:
DDU (Faculty of Tech., Dept. of IT) iii
ACKNOWLEDGEMENTS
We are grateful to Mr. Amit Srivastava (Institute for Plasma Research) for giving us
this opportunity to work under the guidance of prominent Solution Expert in the field of
Software Engineering and also providing us with the required resources at the institute.
We are also thankful to Mr. Vijay Patel (Institute for Plasma Research) for guiding us
in our project and sharing valuable knowledge with us.
It gives us immense pleasure and satisfaction in presenting this report of Project
undertaken during the 8th semester of B.Tech. As it is the first step into our professional
life, we would like to take this opportunity to express our sincere thanks to several
people, without whose help and encouragement, it would be impossible for us to carry
out this desired work.
We would like to express thanks to our Head of Department Prof. R. S. Chhajed who
gave us an opportunity to undertake this work. We are grateful to him for his guidance in
the development process.
Finally we would like to thank all Institute of Plasma Research employees, all the faculty
members of our college, friends and family members for providing their support and
continuous encouragement throughout the project.
Thank you
Bhagat Farida H.
Singh Swati
DDU (Faculty of Tech., Dept. of IT) iv
TABLE OF CONTENTS
ABSTRACT……………………………………………………………………………....1
COMPANY PROFILE………………………………………………………………......3
LIST OF FIGURES……………………………………………………………………...4
LIST OF TABLES……………………………………………………………………….6
1. INTRODUCTION…………………..……………………………………………….7
1.1 Project Details……………………………………………………………………7
1.2 Purpose…………………………………………………………………………....7
1.3 Scope………………………………………………………………………………7
1.4 Objective………………………………………………………………………….8
1.5 Technology and Literature Review……………………………………………..8
1.5.1 Alfresco ECM……………………………………………………………8
1.5.2 Pentaho Platform………………………………………………………...9
2. PROJECT MANAGEMENT………………………………………………………10
2.1 Feasibility Study………………………………………………………………...10
2.2 Project Planning………………………………………………………………...10
2.2.1 Project Development Approach……………………………………….10
2.2.2 Project Plan…………………………………………………………..…11
2.2.3 Milestones and Deliverables...……………………………………….…12
2.2.4 Project Scheduling………………………………………………….…..13
3. SYSTEM REQUIREMENTS STUDY………………………………………..….14
3.1 User Characteristics……………..……………………………………………..14
3.2 Hardware and Software Requirements…………………………………….…14
3.2.1 Hardware Requirements……………………………………………….14
DDU (Faculty of Tech., Dept. of IT) v
3.2.2 Software Requirements……………………………………………..…14
3.3 Constraints…………………………………………………………………...…15
3.3.1 Regulatory Policies…………………………………………………..…15
3.3.2 Hardware Limitations………………………………………………….15
3.3.3 Interfaces to Other Applications………………………………………15
3.3.4 CMIS……………………………………………………………………15
3.3.5 Parallel Operations……………………………………………………..16
3.3.6 Reliability Requirements………………………………………………16
3.3.7 Criticality of the Application………………………………………….16
3.3.8 Safety and Security Considerations………………………………...…16
4. ALFRESCO ECM SYSTEM……………………………………………………...17
4.1 Introduction……………………………………………………………………..17
4.2 Alfresco Overview………………………………………………………………17
4.3 Architecture...…………………………………………………………………...19
4.3.1 Client.……………………………………………………………………19
4.3.2 Server……………………………………………………………………19
4.4 Data Storage in Alfresco……………………………………………………….21
4.5 Relationship Diagrams…………………………………………………………21
5. TRANSFORMATION PHASE……………..…………………………………….24
5.1 Introduction…………………………………………………………………….24
5.2 Pentaho Data Integration Tool….……………………………………………..24
5.2.1 Introduction…………………………………………………………….24
5.2.2 Why Pentaho?..........................................................................................25
5.2.2.1 JasperSoft vs Pentaho vs BIRT……………………………………25
5.2.2.2 Conclusion…………………………………………………………..26
5.2.3 Components of Pentaho………………………………………………..27
5.3 Alfresco Audit Analysis and Reporting Tool………………...……………….28
5.3.1 Introduction…………………………………………………………….28
5.3.2 Working and Installation of A.A.A.R. ..................................................29
DDU (Faculty of Tech., Dept. of IT) vi
5.3.2.1 Pre Requisites……………………………………………………….30
5.3.2.2 Enabling Alfresco Audit Service…………………………………...30
5.3.2.3 Data Mart Creation and Configuration………………………...…30
5.3.2.4 PDI Repository Setting..……………………………………………31
5.3.2.5 First Import………………………………………………………....36
5.3.3 Audit Data Mart………………………………………………………...36
5.3.4 Dimension Tables……………………………………………………….37
5.4 Transformations Using Spoon…………………………………………………38
5.5 Example Transformations………..………………………………………….…38
6. REPORTING PHASE……………..………………………………………….……42
6.1 What is a Report?.……………………………………………………………...42
6.2 Pentaho Report Designer Tool….……………………………………………...42
6.2.1 Introduction……………………………………………………………..42
6.2.2 Working of Pentaho Designer………………………………………….43
6.3 Example Reports………..………………………………………………………44
7. PUBLISHING PHASE……………..………………………………………………46
7.1 Introduction………………..…………………………………………………...46
7.2 Pentaho BI Server………...…………………………………………………….46
7.2.1 In
troduction……………………………………………………………...46
7.2.2 Example Published Reports……………………………………………47
7.3 Scheduling of Transformations…………….………………………………….50
8. TESTING……………..……………………………………………………………..51
8.1 Testing Strategies….…………………………………………………………....51
8.2 Testing Methods………………………………………………………………...52
8.3 Test Cases……………………………………………………………………….53
8.3.1 User Login and Functionality of Report………………………………53
8.3.2 Viewing Documents, Folders, Permissions, Audits…………………...54
DDU (Faculty of Tech., Dept. of IT) vii
9. USER MANUAL……………………………………………………………………55
9.1 Description………………………………………………………………………55
9.2 Login Page………………………………………………………………………55
9.3 View Reports……………………………………………………………………57
9.4 Scheduling………………………………………………………………………59
9.5 Administration……………………………………………………………….....62
10. LIMITATIONS AND FUTURE ENHANCEMENTS……………………………64
10.1 Limitations……………………………………………………………………..64
10.2 Future Enhancements…………………………………………………………64
11. CONCLUSION AND DISCUSSION……………………………………………...65
11.1 Self Analysis of Project Viabilities……………………………………………65
11.1.1 Self Analysis……………………………………………………………...65
11.1.2 Project Viabilities………………………………………………….…….65
11.2 Problems Encountered and Possible Solutions……………………………...65
11.3 Summary of Project Work…………………………………………………...66
12. REFERENCES…………………………………………………………………….68
DDU (Faculty of Tech., Dept. of IT) viii
Abstract
ABSTRACT
Design and implement a platform for Data and Process intelligence tool
IPR has selected Alfresco, an Enterprise Content Management (ECM), as an
Electronic Document and Record Management System (EDRMS). Alfresco do
not have powerful reporting functionality and honestly, it is not its job.
Unfortunately, the need for powerful reporting is still there and most of the
answers are tricky solutions, quite hard to manage and scale. Alfresco ECM has a
detailed audit service that exposes a lot of (potentially) useful information.
Alfresco is integrated with Activiti, a Business Process Management (BPM)
Engine. It also has auditing functionality and exposing the audit data related to
processes and tasks.
Data and Process Intelligent tool (the project) will be divided into two parts. The
first part will be Alfresco Data Integration which will provided a solution to
extract, transform, and load (ETL) data (document/folder/process/task) together
with the audit data at a very detailed level in a central warehouse. On top of that,
it will provide the data cleansing and merging functionality and if needed convert
it in to OLAP format for efficient analysis.
The second part will be the reporting functionality. The goal will be generic
reporting tool useful to the end-user in a very easy way. The data will be
published in reports in well-known formats (pdf, Microsoft Excel, csv, etc.) and
stored directly in Alfresco as static documents organized in folders.
To achieve above goal, Alfresco will be integrated with a powerful open source
data integration and reporting tool. The necessary data from the Alfresco
Repository will be extracted, transformed, merged/integrated and loaded in the
data warehouse. The necessary schema transformation (for example OLTP to
OLAP) will be applied to increase the efficiency. The solution will be scalable
DDU (Faculty of Tech., Dept. of IT) 1
Abstract
and generic Reporting System with an open window on the Business Intelligence
world. Saying that, the solutions will be suitable also for publishing (static)
reports containing not only audit data coming from Alfresco but also Key
Performance Indicators (KPIs), analysis and dashboards coming from a complete
Enterprise Data Warehouse.
DDU (Faculty of Tech., Dept. of IT) 2
Company Profile
COMPANY PROFILE
Institute for Plasma Research (IPR) is an autonomous physics research institute
located in Gandhinagar, India. The institute is involved in research in aspects of
plasma science including basic plasma physics, research on magnetically confined
hot plasmas and plasma technologies for industrial applications. It is a large and
leading plasma physics organization in India. The institute is mainly funded
by Department of Atomic Energy. IPR is playing major scientific and technical
role in Indian partnership in the international fusion energy initiative ITER
(International Thermonuclear Experimental Reactor).
IPR is now internationally recognized for its contributions to fundamental and
applied research in plasma physics and associated technologies. It has a scientific
and engineering manpower of 200 with core competency in theoretical plasma
physics, computer modeling, superconducting magnets and cryogenics, ultra high
vacuum, pulsed power, microwave and RF, computer-based control and data
acquisition and industrial, environmental and strategic plasma applications.
The Centre of Plasma Physics - Institute for Plasma Research has active
collaboration with the following Institutes/ Universities:
Bhabha Atomic Research Centre, Bombay
Raja Ramanna Centre for Advanced Technology, Indore
IPP, Juelich, Germany; IPP, Garching, Germany
Kyushu University, Fukuoka, Japan
Physical Research Laboratory, Ahmedabad
National Institute for Interdisciplinary Science and Technology, Bhubaneswar
Ruhr University Bochum, Bochum, Germany
Saha Institute of Nuclear Physics, Calcutta
St. Andrews University, UK
Tokyo Metropolitan Institute of Technology, Tokyo
University of Bayreuth, Germany; University of Kyoto, Japan.
DDU (Faculty of Tech., Dept. of IT) 3
List of Figures
LIST OF FIGURES
1. MVC Architecture…………….……………………………………….Fig 1.1
2. Flowchart of the project……………………………………………….Fig 2.1
3. Gantt Chart…………………………………………………………….Fig 2.2
4. Alfresco Icon…………………………………………………………...Fig 4.1
5. Uses of Alfresco ECM…………………….………………………...…Fig 4.2
6. Alfresco Architecture……………………….…………………………Fig 4.3
7. Relational Diagrams (users, documents and folders)……………..…Fig 4.4
8. Relational Diagrams (permissions)…………………………………...Fig 4.5
9. Relational Diagrams (audits)………………………………………….Fig 4.6
10. Pentaho Data Integration Icon……………………………………..…Fig 5.1
11. Pentaho Icon…………………………………………………………...Fig 5.2
12. A.A.A.R. Icon………………………………………………………….Fig 5.3
13. Working of A.A.A.R…………………………………………………..Fig 5.4
14. PDI Repository Settings Step 1……...………………………………..Fig 5.5
15. PDI Repository Settings Step 2……...………………………………..Fig 5.6
16. PDI Repository Settings Step 3……...………………………………..Fig 5.7
17. PDI Repository Settings Step 4……...………………………………..Fig 5.8
18. PDI Repository Settings Step 5……...………………………………..Fig 5.9
19. PDI Repository Settings Step 6…….………………………………..Fig 5.10
20. PDI Repository Settings Step 7….....………………………………..Fig 5.11
21. PDI Repository Settings Step 8….....………………………………..Fig 5.12
22. Audit Data Mart……………………………………………………...Fig 5.13
23. Dimension Tables…………………………………………………….Fig 5.14
24. Document Information Transformation…………...……………….Fig 5.15
25. Document Permission Transformation…………...………………...Fig 5.16
26. Folder Information Transformation…………….....……………….Fig 5.17
DDU (Faculty of Tech., Dept. of IT) 4
List of Figures
27. Folder Permission Transformation………………...……………….Fig 5.18
28. User Information Transformation……………….....……………….Fig 5.19
29. Pentaho Reporting Tool Icon……………………………………..…..Fig 6.1
30. Document Information Report…….……..………...………………....Fig 6.2
31. Document Permission Report……...……..………...………………....Fig 6.3
32. Folder Information Report…….……..…..………...………………....Fig 6.4
33. Folder Permission Report………….……..………...………………....Fig 6.5
34. User Information Report…………..……..………...………………....Fig 6.6
35. Pentaho BI Server Icon……………………………………….……….Fig 7.1
36. Document Information Report…….……..………...………………....Fig 7.2
37. Document Permission Report……...……..………...………………....Fig 7.3
38. Folder Information Report…..…….……..………...………………....Fig 7.4
39. Folder Permission Report.……...….……..………...………………....Fig 7.5
40. User Information Report…………..……..………...………………....Fig 7.6
41. Scheduling of Transformations…...…………………………………..Fig 7.7
42. Login Step 1………………………………………………...………….Fig 9.1
43. Login Step 2………………………………………………...………….Fig 9.2
44. Login Step 3………………………………………………...………….Fig 9.3
45. View Reports Step 1………………...…………………………………Fig 9.4
46. View Reports Step 1………………...…………………………………Fig 9.5
47. Scheduling Page………...……………………………………………...Fig 9.6
48. Administration Page……………….…………………………………..Fig 9.7
DDU (Faculty of Tech., Dept. of IT) 5
List of Tables
LIST OF TABLES
1. Milestones and Deliverables……….……………………………… Table 2.1
2. Project Scheduling Table…………………………………………...Table 2.2
3. Test Case 1…………………………………………………………..Table 8.1
4. Test Case 2…………………………………………………………..Table 8.2
5. Scheduling Options………………..………………………………..Table 9.1
6. Scheduling Controls………………..……………………………….Table 9.2
7. Administration Options…………………………………………….Table 9.3
DDU (Faculty of Tech., Dept. of IT) 6
Introduction
INTRODUCTION
1.1PROJECT DETAILSInstitute of Plasma Research has selected Alfresco, an Enterprise Content Management
(ECM), as an Electronic Document and Record Management System (EDRMS). Alfresco
do not have powerful reporting functionality. Thus, IPR requires a reporting tool to
present the various details related to metadata of the documents and folders (folders are
used to organize documents), access control applied on documents and folders.
Additional analyses (like most active user, most active documents in last week, months
etc.) are required on audit trailing data generated by the alfresco. Some Key Performance
Indicators (KPI) needs to be generated of the document review and approval process.
Possibilities to create and exports reports in well-known formats (pdf, Microsoft Excel,
csv, etc.) needs to be provided. There will be a central administrator who has the
possibility to configure the access rights on the reports for end users. Additionally end
users shall have the possibility to subscribe the reports and schedule the report generation
and send it via E-mail as attachment in preferred format.
1.2PURPOSEThis system needs to be developed to enhance the way of looking at a traditional
document management system and to make it more user-friendly. Along with all the
features, we need a few customizations for the better usability of the resources. With
these powerful reporting tools, it will become easy and secure to understand the files and
documents in the institute. Also, it would help in decision making so as to what steps
have to be taken on the basis of the reports generated.
1.3SCOPEThe scope of the current project is just to implement a framework/deployment
DDU (Faculty of Tech., Dept. of IT) 7
Introduction
architecture using BI toolset and test it by integrating with Alfresco. Alfresco data mart
will be created and used for developing analysis reports related to document management
system. The reports will be made available securely to the employees of the institute,
collaborators and contractors on internet.
In future generic reporting architecture implemented as part of this project will be used
and extended as a full Data Warehouse solution by integrating and merging other data
management tools of IPR. The full DW solution is out of the scope of this project.
1.4OBJECTIVEThe objective of this project is to ease the visibility of the document management system
and enhance decision making. Alfresco is a powerful content management system.
Unfortunately, the need for powerful reporting is still there and most of the answers are
tricky solutions, quite hard to manage and scale. To achieve above goal, Alfresco will be
integrated with a powerful open source data integration and reporting tool. The necessary
data from the Alfresco Repository will be extracted, transformed, merged/integrated and
loaded in the data warehouse. The necessary schema transformation (for example OLTP
to OLAP) will be applied to increase the efficiency. The solution will be scalable and
generic Reporting System with an open window on the Business Intelligence world.
Saying that, the solutions will be suitable also for publishing (static) reports containing
not only audit data coming from Alfresco but also Key Performance Indicators (KPIs),
analysis and dashboards coming from a complete Enterprise Data Warehouse.
1.5TECHNOLOGY AND LITERATURE REVIEW
1.5.1 ALFRESCO ECM
Open source java based Enterprise Content Management system (ECM) named Alfresco
is selected as a document repository. It uses MVC architecture. Model–view–
DDU (Faculty of Tech., Dept. of IT) 8
Introduction
controller (MVC) is a software pattern for implementing user interfaces. It divides a
given software application into three interconnected parts, so as to separate internal
representations of information from the ways that information is presented to or accepted
from the user.
Model: It consists of application data, business rules, logic and functions. Here, XML is
used for the same.
View: It is the output representation of information. Here, FTL is used for the same.
Controller: It accepts input and converts it to commands for the model or view.
Figure 1.1 MVC Architecture
1.5.2 PENTAHO PLATFORM
Pentaho is a company that offers Pentaho Business Analytics, a suite of open
source Business Intelligence (BI) products which provide data integration, OLAP
services, reporting, dashboard, data mining and ETL capabilities. Pentaho was founded in
2004 by five founders and is headquartered in Orlando, FL, USA.
Pentaho software consists of a suite of analytics products called Pentaho Business
Analytics, providing a complete analytics software platform. This end-to-end solution
includes data integration, metadata, reporting, OLAP analysis, ad-hoc query, dashboards,
and data mining capabilities. The platform is available in two offerings: a community
edition (CE) and an enterprise edition (EE).
DDU (Faculty of Tech., Dept. of IT) 9
Project Management
PROJECT MANAGEMENT
2.1 FEASIBILITY STUDYFeasibility study includes an analysis and evaluation of a proposed project to determine if
it is technically feasible, is feasible within the estimated cost, and will be profitable.
The following softwares have to be installed for the project:-
1. Alfresco Entity Content Management
2. PostgreSQL and SQuirreL
3. Pentaho Data Integration Tool (K.E.T.T.L.E.)
4. Alfresco Audit Analysis and Reporting Tool (A.A.A.R.)
5. Pentaho Reporting Tool
6. Pentaho BI Server
The study assures that the hardware cost required for one database server plus two web
servers is acceptable and the 500 GB of file storage for the final product is feasible.
2.2PROJECT PLANNING
2.2.1 Project Development ApproachWe have used Agile methodology. After the feasibility study, the first thing to be done
was to create a basic flowchart charting out the flow of the project so as to create a mind
map. The base database system is Alfresco, from where we need to load tables using
PostGreSQL or SQuirreL. The number of tables is compressed to create a staging data
warehouse. After the transformations on these tables using Pentaho Data Integration tool,
reports are created using Pentaho Reporting on the BI server, according to the given
requirements of the project.
DDU (Faculty of Tech., Dept. of IT) 10
Project Management
Figure 2.1 Flowchart of the Project
Once, the flowchart was made, we proceeded towards the development part keeping the
flowchart in mind. Thus, we started from studying Alfresco Enterprise Content
Management System and then moved on to Pentaho Tools. We also installed PostgreSQL
and SQuirreL so as to deal with the queries.
2.2.2 Project Plan1. Gather the definition.
2. Check whether the definition is feasible or not in given deadline.
3. Requirement gathering.
4. Study and analysis on gathered requirements.
5. Transformation Phase.
6. Reporting Phase.
7. Deployment.
DDU (Faculty of Tech., Dept. of IT) 11
Project Management
2.2.3 Milestones and Deliverables
Table 2.1 Milestones and Deliverables
Phase Deliverables Purpose
Abstract and
System Feasibility
Study
Had complete
understanding of the
flow of the project
To be familiar with
the flow of the project
Requirement
Gathering and
Software Installation
and understanding of
Technology
Had studied the ECM,
it’s architecture and
how the data is stored in
the Alfresco repository
Getting familiar with
the Alfresco platform
Study of Platform
and the tools with it
Had studied and used
the three tools namely,
Pentaho Data
Integration Tool,
Pentaho Report
Designer and Pentaho
BI Server
Better understanding
of the Pentaho
platform and all the
tools and plug ins
associated with it
Transformation
Phase
Completed the
transformation phase
with help of A.A.A.R,
developed some custom
ETL and schedules the
transformation jobs to
run during nights
To make the staging
data warehouse
Reporting phase Made the reports
according to the user’s
requirements
To complete the
reporting phase
DDU (Faculty of Tech., Dept. of IT) 12
Project Management
Deployment Published it on server in
different output types,
like pdf, csv etc
Deploy it on the Web
and hence completing
the project
2.2.4 Project SchedulingIn project management, a schedule is a listing a project's milestones, activities,
and deliverables, usually with intended start and finish dates.
Table 2.2 Project Scheduling Table
Abstract and Feasibility Study
Requirement Gathering
Study of Database Management System
Study of platform and associated tools
Transformation Phase
Reporting Phase
Deployment
8-Dec 23-Dec 7-Jan 22-Jan 6-Feb 21-Feb 8-Mar 23-Mar
Figure 2.2 Gantt Chart
DDU (Faculty of Tech., Dept. of IT) 13
System Requirement Study
SYSTEM REQUIREMENT STUDY
3.1 USER CHARACTERISTICSThis system is made available on the web so it can be accessed from anywhere. The
users will be scientists, researchers, engineers and other employees of the institute. They
login with their respective credentials user will be logged in.
3.2 HARDWARE AND SOFTWARE REQUIREMENTS
3.2.1 Server and Client side Hardware Requirements:RAM : 4GB
Hard-disk : 40GB
Processor: 2.4GHz
File Storage: 500GB
3.2.2 Server and Client side Software Requirements:Windows or Linux based system
PostgreSQL Database
SQuirreL Database Client tool
Alfresco ECM
Pentaho Community Edition 5.0 (PDI, Reporting Tool, BI Server)
Alfresco Audit Analyzing and Reporting tool
Notepad++
3.3 CONSTRAINTS
3.3.1 Regulatory Policies
DDU (Faculty of Tech., Dept. of IT) 14
System Requirement Study
Regulatory policies, or mandates, limit the discretion of individuals and agencies, or
otherwise compel certain types of behavior. These policies are generally thought to be
best applied when good behavior can be easily defined and bad behavior can be easily
regulated and punished through fines or sanctions. IPR is very strict about its policies and
ensures that all the employees follow it properly.
3.3.2 Hardware LimitationsTo ensure the smooth working of the system, we need to meet the minimum hardware
requirements. We need at least 2GB RAM, 40GB hard disk and 2.4 GHz processor. All
these requirements are readily available. Hence, there are not really any hardware
limitations.
3.3.3 Interfaces to Other ApplicationsETL tool of BI suits generally supports a number of standards-based protocols including
the ODBC, JDBC, REST, web script, FTP and many more for extracting the data from
multiple sources. It is easy to integrate any data management application using supported
input protocols. We have used CMIS (Content Management Inter-operatibility Services)
and JDBC protocol for Alfresco data integration. The published reports will be integrated
back to Alfresco using http protocol. Single sign Up will be implemented by the IT
department for providing the transparent access of reports form Alfresco or any other
web based tools.
3.3.4 CMISCMIS (Content Management Interoperability Services) is an OASIS standard designed
for the ECM industry. It enables access to any content management repository that
implements the CMIS standard. We can consider using CMIS if application needs
programmatic access to the content repository.
DDU (Faculty of Tech., Dept. of IT) 15
System Requirement Study
3.3.5Parallel OperationsThis is a document management system where around 300 employees will work
concurrently. They can upload a document, review it, modify it, start workflow and even
delete it. Parallel operations include allowing more than single employee to read the
document. Work flows can be started to any document; also any document can be in any
number of workflows. Parallel editing of a document will be restricted by providing
check-in and check-out functionality.
3.3.6 Reliability RequirementsAll quality hardware, software and frameworks with valid licenses are required for better
reliability.
3.3.7 Criticality of the ApplicationCriticality of the module was one of the concerned constraints. The system was being
developed for the users who were mainly employees of the government sector. They had
certain rigid aspects which were to be taken care during development. Any change in
pattern of their workflow would lead to extremely critical conditions. Thus this was a
matter of concern and served as one of the deep rooted constraints.
3.3.8 Safety and Security ConsiderationThe system provides a tight security to user account. It is secured by password
mechanism which are encrypted and stored to database. Also, the repository is accessible
for modifications only to some privileged users.
DDU (Faculty of Tech., Dept. of IT) 16
Alfresco ECM System
ALFRESCO ECM SYSTEM
4.1INTRODUCTION
Figure 4.1 Alfresco Icon
Alfresco is a free enterprise content management system for both Windows and Linux
operating systems, which manages all the content within an enterprise and provides
services to manage this content.
It comes in three flavors:-
Community edition – It is a free software with some limitations. No clustering
feature is present. (We have used community edition of Alfresco for this project
since we just need to perform ETL logic on the database and not use the advanced
functionalities.)
Enterprise edition – It is commercially licensed and suitable for user who requires
a higher degree of functionalities.
Cloud edition - It is a SaaS (Software as a Service) version of Alfresco.
We would be using Alfresco database as our base database from where we want to
extract information and create a warehouse. For further transformation purpose, we
would be using SQuirreL and K.E.T.T.L.E a.k.a. Pentaho Data Integration tool.
4.2ALFRESCO OVERVIEWThere are various ways in which Alfresco can be used for storing files and folders and it
can also be used by different systems. It is basically a repository, which is a
central location where data are stored and managed.
DDU (Faculty of Tech., Dept. of IT) 17
Alfresco ECM System
Few of the ways in which Alfresco can be used are:
Figure 4.2 Uses of Alfresco ECM
Alfresco ECM is a useful tool to store files and folders of different types. Few of the uses
of Alfresco are:-
Document Management
Records Management
Shared drive replacement
Enterprise portals and intranets
Web Content Management
Knowledge Management
Information Publishing
Case Management
4.3 ARCHITECTURE
DDU (Faculty of Tech., Dept. of IT) 18
Alfresco ECM System
Alfresco has a layered architecture with mainly three parts:-
1. Alfresco Client.
2. Alfresco Content Application Server
3. Physical Storage
4.3.1 ClientAlfresco offers two primary web-based clients: Alfresco Share and Alfresco Explorer.
Alfresco Share can be deployed to its own tier separate from the Alfresco content
application server. It focuses on the collaboration aspects of content management and
streamlining the user experience. Alfresco Share is implemented using Spring Surf and
can be customized without JSF knowledge.
Alfresco Explorer is deployed as part of the Alfresco content application server. It is a
highly customizable power-user client that exposes all features of the Alfresco content
application server and is implemented using Java Server Faces (JSF).
Clients also exist for portals, mobile platforms, Microsoft Office, and the desktop. A
client often overlooked is the folder drive of the operating system, where users share
documents through a network drive. Alfresco can look and act just like a folder drive.
4.3.2 ServerThe Alfresco content application server comprises a content repository and value-added
services for building ECM solutions. Two standards define the content repository: CMIS
(Content Management Interoperability Services) and JCR (Java Content Repository).
These standards provide a specification for content definition and storage, content
retrieval, versioning, and permissions. Complying with these standards provides a
reliable, scalable, and efficient implementation.
The Alfresco content application server provides the following categories of services
built upon the content repository:
DDU (Faculty of Tech., Dept. of IT) 19
Alfresco ECM System
1. Content services (transformation, tagging, metadata extraction)
2. Control services (workflow, records management, change sets)
3. Collaboration services (social graph, activities, wiki)
Clients communicate with the Alfresco content application server and its services through
numerous supported protocols. HTTP and SOAP offer programmatic access while CIFS,
FTP, WebDAV, IMAP, and Microsoft SharePoint protocols offer application access. The
Alfresco installer provides an out-of-the-box prepackaged deployment where the
Alfresco content application server and Alfresco Share are deployed as distinct web
applications inside Apache Tomcat.
Figure 4.3 Alfresco Architecture
At the core of the Alfresco system is a repository supported by a server that persist
content, metadata, associations, and full text indexes. Programming interfaces support
multiple languages and protocols upon which developers can create custom applications
and solutions. Out-of-the-box applications provide standard solutions such as document
management, and web content management.
DDU (Faculty of Tech., Dept. of IT) 20
Alfresco ECM System
4.4 DATA STORAGE IN ALFRESCOThere are total 97 tables in the database mainly divided into two parts Alfresco databases
and Activity workflows. The Alfresco database is further divided into three parts- nodes,
access and properties.
1. Node is the parent class of the database which has all identity numbers stored in
it.
2. Access tables deals with the security issues of Alfresco like the permissions and
last modification dates.
3. Properties store the information about which kind of data is stored; its size,
type, ranges etc.
4.5 RELATIONSHIP DIAGRAMSAfter studying the tables, we created the relationship diagram of the tables using
SQuirreL.
Since the Relational Diagram for the Alfresco System comprises 97 tables, we selected
the ones that are vital like:-
alf_node – holds the identity of other tables.
alf_qname – It defines a valid identifier for each and every attribute.
alf_node_properties – It connects both node and qname tables and stores all
properties of each node id.
alf_access_control_list – It is used to specify who can do what with an object in
the repository i.e. gives the permission information.
DDU (Faculty of Tech., Dept. of IT) 21
Alfresco ECM System
Figure 4.4 Relation Diagrams for users, documents and folders
Figure 4.5 Relational Diagrams for permissions
DDU (Faculty of Tech., Dept. of IT) 22
Alfresco ECM System
Figure 4.6 Relational Diagram for audits
DDU (Faculty of Tech., Dept. of IT) 23
Transformation Phase
TRANSFORMATION PHASE
5.1INTRODUCTIONThere are 97 tables in Alfresco ECM System. To create a staging data warehouse, we
first have to perform E.T.L. logic i.e. Extract, Transform and Load.
In computing, ETL refers to a process in database usage and especially in data
warehousing where it:
Extracts data from homogeneous or heterogeneous data source
Transforms the data for storing it in proper format or structure for querying
and analysis
Loads it into the final target (database, more specifically, operational data
store, data mart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes time, so
while the data is being pulled another transformation process executes, processing the
already received data and prepares the data for loading and as soon as there is some
data ready to be loaded into the target, the data loading kicks off without waiting for
the completion of the previous phases.
ETL systems commonly integrate data from multiple applications (systems), typically
developed and supported by different vendors or hosted on separate computer
hardware. The disparate systems containing the original data are frequently managed
and operated by different employees. In our project though, there is only one source
from where the data is extracted i.e. Alfresco.
5.2 PENTAHO DATA INTEGRATION TOOL
5.2.1 IntroductionPentaho Data Integration (or Kettle) delivers a powerful extraction, transformation,
DDU (Faculty of Tech., Dept. of IT) 24
Transformation Phase
and loading (ETL) capabilities, using a metadata-driven approach. It prepares and
blends data to create a complete picture of the business that drives actionable
insights. The complete data integration platform delivers accurate, “analytics ready”
data to end users from any source.
Figure 5.1Pentaho Data Integration Icon
In particular, Pentaho Data Integration is used to: extract Alfresco audit data into the Data
Mart and create the defined reports uploading them back to Alfresco.
5.2.2 Why Pentaho?
Figure 5.2 Pentaho Icon
5.2.2.1 Pentaho vs Jaspersoft vs BIRT
Pentaho and Jaspersoft, both provide the unique advantage of being cost effective but the
differences in terms of features vary. Although Jaspersoft’s report for designing reports is
comparatively better than Pentaho Report Designer, the dashboard capabilities of Pentaho
in terms of functionality are better. This is because dashboard functionality is present
only in the Enterprise edition of Jaspersoft whereas in Pentaho, it is accessible in the
Community edition too.
DDU (Faculty of Tech., Dept. of IT) 25
Transformation Phase
When it comes to Extract, Transfer and Load (ETL) tools, the Pentaho Data Integrator is
comparatively better since Jaspersoft falls short of few functions. When it comes to
OLAP analysis, Pentaho Mondrian engine has a stronger case compared to Jaspersoft.
Pentaho users also have huge set of choices in terms of plugin marketplace that is similar
to the app store of iOS and Android. To sum it up, Jaspersoft focus is more on reporting
and analysis and Pentaho’s focus is on data integration, ETL and workflow automation.
BIRT has also emerged as an important tool for business intelligence for those who are
well versed in Java. BIRT is an Eclipse-based open source reporting system for web
applications, especially those based on Java and Java EE where it consists of a report
designer based on Eclipse and a runtime component that be added to the app server. In
terms of basic functionality BIRT is at par with Pentaho and Jaspersoft perhaps a slight
advantage as it is based on Eclipse. Apart from that as a typical BI tool it is expected to
cover common Chart Types. Although BIRT covers most of the charts, it falls short of
Chart types like Ring, Waterfall, Step Area, Step, Difference, Thermometer and Survey
scale wherein Pentaho fills the gaps.
5.2.2.2 Conclusion
Unlike previous two tools, Pentaho is a complete BI suite covering various operations
from reporting to data mining. The key component of Pentaho is the Pentaho Reporting
which is a rich feature set and enterprise friendly. Its BI Server which is a J2EE
application also provides an infrastructure to run and view reports through a web-based
user interface. All of the following open source BI and reporting tools provide a rich
feature set ready for enterprises. It is based on the end user to thoroughly compare and
select either of these tools. All three of these open source business intelligence and
reporting tools provide a rich feature set ready for enterprise use. It will be up to the end
user to do a thorough comparison and select either of these tools. Major differences can
be found in report presentations, with a focus on web or print, or in the availability of a
report server. Pentaho distinguishes itself by being more than just a reporting tool, with a
full suite of components (data mining and integration).
DDU (Faculty of Tech., Dept. of IT) 26
Transformation Phase
Among organizations adopting Pentaho, one of the advantages felt is its low integration
time and infrastructural cost compared to SAP BIA, SAS BIA which are one of the big
players in Business Intelligence. Along with that the huge community support available
24/7 with active support forums allows Pentaho users to discuss the challenges and have
their questions cleared while using the tool. Its unlimited visualization and data sources
can handle any kind of data, coupled with a good tool set which has wide applicability
beyond just the base product.
5.2.3 COMPONENTS OF PENTAHOKettle is a set of tools and applications which allows data manipulations across multiple
sources. The main components of Pentaho Data Integration are:
Spoon – It is a graphical tool that makes the design of an ETL process transformation
easy to create. It performs the typical data flow functions like reading, validating,
refining, transforming, writing data to a variety of different data sources and destinations.
Transformations designed in Spoon can be run with Kettle Pan and Kitchen.
Pan – Pan is an application dedicated to run data transformations designed in Spoon.
Chef – It is a tool to create jobs which automate the database update process in a
complex way.
Kitchen – It is an application which helps execute the jobs in a batch mode, usually
using a schedule which makes it easy to start and control the ETL processing.
Carte – It is a web server which allows remote monitoring of the running Pentaho Data
Integration ETL processes through a web browser.
5.3 ALFRESCO AUDIT ANALYSIS AND REPORTING TOOL
5.3.1 Introduction
DDU (Faculty of Tech., Dept. of IT) 27
Transformation Phase
Alfresco is one of the most widely used open source content management systems.
And though it is not part of its core, it is crucial to get metrics out of the Alfresco
system.
Figure 5.3 A.A.A.R. Icon
To that goal, a full-fledged audit layer was built on top of Alfresco using Pentaho. The
principle is that it is used for doing optimized analytics to build a data mart properly
optimized for the information we are extracting for the system and doing all the discovery
on top of that. To do that, one need an ETL tool and then once that is done, Pentaho is
needed to do reporting and exploration on top of that data warehouse. This in-between
tool is called AAAR - Alfresco Audit Analysis and Reporting.
5.3.2 Working and Installation of A.A.A.R.Alfresco Content Management System can be seen as a primary source and generates
only raw data. On the other hand, Pentaho is a pure BI environment and consists of some
suitable integration and reporting tools.
Thus, A.A.A.R. extracts audit data from the Alfresco E.C.M., stores the data in the Data
Mart, creates reports in well-known formats and publishes them again in the Alfresco
E.C.M.
DDU (Faculty of Tech., Dept. of IT) 28
Transformation Phase
Figure 5.4 Working of A.A.A.R.
Alfresco E.C.M. is, at the same time, source and target of the flow. As source of the flow,
Alfresco E.C.M. is enabled with the audit service to track all the activities with detailed
information about who, when, what has been done on the system. Login (failed or
succeed), creation of content, creation of folders, adding or removing of properties or
aspects are only some examples of what is tracked from the audit service.
5.3.2.1 Prerequisites
1. Alfresco E.C.M.
2. PostGreSQL/MySQL
3. Pentaho Data Integration Tool
4. Pentaho Report Designer Tool
5.3.2.2 Enabling Alfresco Audit Service
The very first task to do is to activate the audit service in Alfresco performing the actions.
1. Stop Alfresco.
2. In '<Alfresco>/tomcat/shared/classes/alfresco-global.properties' append: # Alfresco Audit service
audit.enabled=true
audit.alfresco-access.enabled=true
# Alfresco FTP service
## ATTENTION: Don’t do it if just enabled!
ftp.enabled=true
ftp.port=8082
DDU (Faculty of Tech., Dept. of IT) 29
Transformation Phase
3. Start Alfresco.
4. Login into Alfresco to have the very first audit data.
5.3.2.3 Data Mart Creation and Configuration
1. Open a terminal
2. For the PostgreSQL platform use: cd <PostgreSQL bin>
psql –U postgres –f “<AAAR folder>/AAAR_DataMart.sql”
(use ‘psql.exe’ on Windows platform and ‘./psql’ on Linux based platforms)
3. Exit
4. Extract ‘reports.zip’ in the ‘data-integration’ folder. ‘report.zip’ contains 5 files with
‘prpt’ extension, each one containing one Pentaho Reporting Designer report. By
default, and to let the report production simpler, are saved in the default folder: ‘data-
integration’.
5. Update ‘dm_dim_alfresco’ table with the proper environment settings. Each row of
the table represents one Alfresco installation and for that reason the table is defined
with a
unique row by default, as described below.
desc with value ‘Alfresco’.
login with value ‘admin’.
password with value ‘admin’.
url with value ‘http://localhost:8080/alfresco/service/api/audit/query/alfresco- access?
verbose=true&limit=100000’.
is_active with value ‘Y’.
6. Update ‘dm_reports’ table with your target settings.
5.3.2.4 PDI Repository Settings
The third task is to set the Pentaho Data Integration Jobs properly.
1. Open a terminal
2. For the PostgreSQL platform use: cd <PostgreSQL bin>
DDU (Faculty of Tech., Dept. of IT) 30
Transformation Phase
psql –U postgres –f “<AAAR folder>/AAAR_Kettle.sql”
(use ‘psql.exe’ on Windows platform and ‘./psql’ on Linux based platforms)
3.Exit
4. To set the Pentaho Data Integration repository:
i. Open a new terminal. cd <data-integration>
ii. Launch ‘Spoon.bat’ if you are on Windows platform or ‘./Spoon.sh’ if you are
on Linux based platforms.
iii. Click on the green plus to add a new repository and define a new repository
connection in the database.
Figure 5.5 Step 1
iv. Add a new database connection to the repository.
DDU (Faculty of Tech., Dept. of IT) 31
Transformation Phase
Figure 5.6 Step 2
v. If you choose a PostgreSQL platform set the parameters described below in the
image. At the end push the test button to check the database connection.
Figure 5.7 Step 3
DDU (Faculty of Tech., Dept. of IT) 32
Transformation Phase
vi. Set the ID an Name fields and press the ‘ok’ button. Attention not to push the
‘create or upgrade’ button otherwise the E.T.L. will be damaged.
Figure 5.8 Step 4
vii. Connect with the login ‘admin’ and password ‘admin’ to test the connection.
Figure 5.9 Step 5
viii. If everything succeeds, you see the Pentaho Data Integration (Kettle) panel.
viii. From the Pentaho Data Integration panel, click on Tool -> Repository ->
explore.
DDU (Faculty of Tech., Dept. of IT) 33
Transformation Phase
Figure 5.10 Step 6
ix. Click on the ‘Connections’ tab and edit (the pencil on the top right) the
AAAR_DataMart connection. In the image below the PostgreSQL case but with
MySql is exactly the same.
Figure 5.11 Step 7
DDU (Faculty of Tech., Dept. of IT) 34
Transformation Phase
x. Modify the parameters and click on the test button to check. If everything
succeed you can close all. In the image below the PostgreSQL case but with
MySql is exactly the same.
Figure 5.12 Step 8
5.3.2.5 First Import
Now you are ready to get the audit data in the Data Mart and create the reports publishing
them to Alfresco.
Open a terminal cd <data-integration>
kitchen.bat /rep:"AAAR_Kettle" /job:"Get all/dir:/Alfresco /user:admin
/pass:admin /level:Basic
kitchen.bat /rep:"AAAR_Kettle" /job:"Report all" /dir:/Alfresco
/user:admin /pass:admin /level:Basic
Finally you can access to Alfresco and look in the repository root where the reports are
uploaded by default.
5.3.3 Audit Data MartOn the other side of the represented flow, there is a database storing the
DDU (Faculty of Tech., Dept. of IT) 35
Transformation Phase
extracts audit data organized in a specific Audit Data Mart. A Data Mart is a structure
that is usually oriented to a specific business line or team and, in this case, represents the
audited actions in the Alfresco E.C.M.
Figure 5.13 Audit Data Mart
5.3.4 Dimension TablesThe implemented Data Mart develops a single Star Schema having one only measure (the
number of audited actions) and the dimensions listed
below:
1. Alfresco instances to manage multiple sources of auditing data.
2. Alfresco users with a complete name.
3. Alfresco contents complete with the repository path.
4. Alfresco actions (login, failedLogin, read, addAspect, etc.).
5. Date of the action. Groupable in day, month and year.
DDU (Faculty of Tech., Dept. of IT) 36
Transformation Phase
6. Time of the action. Groupable in minute and hour.
Figure 5.14 Dimension Tables
5.4 TRANSFORMATIONS USING SPOONThe Spoon is the only DI design tool component. The DI Server is a core component that
executes data integration jobs and transformations using the Pentaho Data Integration
Engine. It also provides the services allowing you to schedule and monitor scheduled
activities.
Drag elements onto the Spoon canvas, or choose from a rich library of more than 200
pre-built steps to create a series of data integration processing instructions.
5.5 EXAMPLE TRANSFORMATION
Few of the transformations we have done using Spoon are listed below:-
1. Document Information
2. Document Permission
3. Folder Information
DDU (Faculty of Tech., Dept. of IT) 37
Transformation Phase
4. Folder Permission
5. User Information
Figure 5.15 Document Information Transformation
Figure 5.16 Document Permission Transformation
DDU (Faculty of Tech., Dept. of IT) 38
Transformation Phase
Figure 5.17 Folder Information Transformation
Figure 5.18 Folder Permission Transformation
DDU (Faculty of Tech., Dept. of IT) 39
Transformation Phase
Figure 5.19 User Information Transformation
DDU (Faculty of Tech., Dept. of IT) 40
Reporting Phase
REPORTING PHASE
6.1 WHAT IS A REPORT?In its most basic form, a report is a document that contains information for the reader.
When speaking of computer generated reports, these documents refine data from various
sources into a human readable form. Report documents make it easy to distribute specific
fact-based information throughout the company. Reports are also used by the
management departments in decision making.
6.2 PENTAHO REPORT DESIGNER TOOL
6.2.1 IntroductionPentaho Reporting is a suite of tools for creating pixel perfect reports. With Pentaho
Reporting, we are able to transform data into meaningful information. You can create
HTML, Excel, PDF, Text or printed reports. If you are a developer, you can also produce
CSV and XML reports to feed other systems.
Figure 6.1 Pentaho Reporting Tool Icon
It helps in transforming all the data into meaningful information tailored according to
your audience with a suite of Open Source tools that allows you to create pixel-perfect
reports of your data in PDF, Excel, HTML, Text, Rich-Text-File, XML and CSV. These
computer generated reports easily refine data from various sources into a human readable
form.
DDU (Faculty of Tech., Dept. of IT) 41
Reporting Phase
6.2.2 Working of Pentaho Report Designer ToolOnce, the transformations are completed using K.E.T.T.L.E., we can import these
transformations from the Data Mart in the Pentaho Report Designer tool with the help of
SQL. Pentaho Report Designer tool has a large selection of elements (Text fields, Labels
etc.) and various GUI representation techniques like pie-charts, tables, graphs etc with
which we can create our reports.
6.3 EXAMPLE REPORTSAccording to the transformations done using Spoon, we created reports of the following
requirements using Pentaho Report Designer:-
1. Document Information
2. Document Permission
3. Folder Information
4. Folder Permission
5. User Information
Figure 6.2 Document Information Report
DDU (Faculty of Tech., Dept. of IT) 42
Reporting Phase
Figure 6.3 Document Permission Report
Figure 6.4 Folder Information Report
DDU (Faculty of Tech., Dept. of IT) 43
Reporting Phase
Figure 6.5 Folder Permission Report
Figure 6.6 User Information Report
DDU (Faculty of Tech., Dept. of IT) 44
Publishing Phase
PUBLISHING PHASE
7.1INTRODUCTIONAfter the reports are made using the designing tool, we need to publish the reports on the
server. Pentaho BI Server or BA platform allows you to access business data in the form
of dashboards, reports or OLAP cubes via a convenient web interface. Additionally it
provides an interface to administer your BI setup and schedule processes. Also, different
types of output types are available like pdf, html, csv etc.
7.2PENTAHO BI SERVER
7.2.1 IntroductionIt is commonly referred to as the BI Platform, and recently renamed Business Analytics
Platform (BA Platform). It makes up the core software piece that hosts content created
both in the server itself through plug-ins or files published to the server from the desktop
applications. It includes features for managing security, running reports, displaying
dashboards, report bursting, scripted business rules, OLAP analysis and scheduling out of
the box.
Figure 7.1 Pentaho BI Server Icon
The commercial plug-ins from Pentaho expand out-of-the-box features. A few open-
source plug-in projects also expand capabilities of the server. The Pentaho BA Platform
runs in the Apache Java Application Server. It can be embedded into other Java
DDU (Faculty of Tech., Dept. of IT) 45
Publishing Phase
Application Servers.
7.2.2 Example Published ReportsAccording to the reports we have created, the following reports can be deployed on the
Web:-
1. Document Information
2. Document Permission
3. Folder Information
4. Folder Permission
5. User Information
Figure 7.2 Document Information Published Report
DDU (Faculty of Tech., Dept. of IT) 46
Publishing Phase
Figure 7.3 Document Permission Published Report
Figure 7.4 Folder Information Published Report
DDU (Faculty of Tech., Dept. of IT) 47
Publishing Phase
Figure 7.5 Folder Permission Published Report
Figure 7.6 User Information Published Report
DDU (Faculty of Tech., Dept. of IT) 48
Publishing Phase
7.3 SCHEDULING OF TRANSFORMATIONSOnce, the project has been completed, for real-time usage, the data warehouse needs to be
updated after every particular interval. For that purpose, we have to create scheduling for
our project so that it gets updated every day reflecting changes done in the last 24 hours.
There are three types to perform scheduling:
1. Using schedule option from action menu in Spoon.
2. Using start element in job i.e. kjb(Kettle jobs) files
3. Using task scheduler
Usually, first method is preferred in industries, but as we are working on community
edition, scheduling option is not provided. Also, the second method is used just for jobs
and does not update transformations so it was not suitable.
So, we scheduled the project using task scheduler. We have scheduled all the
transformations. It has been scheduled in such a way that it will run daily at 11:00 am.
The project has been deployed on web and submitted to our external guide. It will be
used further by IPR on web server for real-time usage.
Figure 7.7 Scheduling of Transformations
DDU (Faculty of Tech., Dept. of IT) 49
Testing
TESTING
8.1 TESTING STRATEGYData completeness: Ensures that all expected data is loaded in to target table.
1. Compare records counts between source and target and check for any rejected
records.
2. Check Data should not be truncated in the column of target table.
3. Check unique values has to load in to the target. No duplicate records should
exist.
4. Check boundary value analysis
Data quality: Ensures that the ETL application correctly rejects, substitutes default
values, corrects or ignores and reports invalid data.
Data cleanness: Unnecessary columns should be deleted before loading into the staging
area.
1. Example: If a column have name but it is taking extra space , we have to
“trim” space so before loading in the staging area with the help of expression
transformation space will be trimmed.
2. Example: Suppose telephone number and STD code in different columns and
requirement says it should be in one column then with the help of expression
transformation we will concatenate the values in one column.
Data Transformation: All the business logic implemented by using ETL-Transformation
should reflect.
Integration testing: Ensures that the ETL process functions well with other upstream and
downstream processes.
DDU (Faculty of Tech., Dept. of IT) 50
Testing
User-acceptance testing: Ensures the solution meets users’ current expectations and
anticipates their future expectations.
Regression testing: Ensures existing functionality remains intact each time a new release
of code is completed.
8.2 TESTING METHODS• Functional test: it verifies that the item is compliant with its specified business
requirements.
• Usability test: it evaluates the item by letting users interact with it, in order to verify that
the item is easy to use and comprehensible.
• Performance test: it checks that the item performance is satisfactory under typical
workload conditions.
• Stress test: it shows how well the item performs with peak loads of data and very heavy
workloads.
• Recovery test: it checks how well an item is able to recover from crashes, hardware
failures and other similar problems.
• Security test: it checks that the item protects data and maintains functionality as
intended.
• Regression test: It checks that the item still functions correctly after a change has
occurred.
DDU (Faculty of Tech., Dept. of IT) 51
Testing
8.3 TEST CASES
8.3.1 USER LOGIN AND USING THE FUNCTIONALITY OF REPORT
Description: This test will validate user name and password and he will be able to select
the desired format of reports with the desired selection option
Table 8.1 Test Case 1
Sr. No Test Case Expected Output Actual Output Test Case
Status
1 User login to
his/her page
BA server should
open
BA server page
opens
Pass
2 User views a
report
Report should be
displayed
User is able to
view report
Pass
3 User while
viewing selects
the type of
output format
User must see the
desired format
output
Desired format
of the report is
displayed
Pass
4 User filters out
the report view
User should see
the filtered report
User is able to
view the
desired report
Pass
DDU (Faculty of Tech., Dept. of IT) 52
Testing
8.3.2 VIEWING DOCUMENTS, FOLDERS, PERMISSIONS, AUDITS
Description: This test case will check whether user is able to view the data of folders,
documents, its permissions and audit data.
Table 8.2 Test Case 2
Sr.
No
Test Case Expected Output Actual Output Test Case
Status
1 User view the
documents
Document details
should be displayed
Document is
seen
Pass
2 User view the
folders
Folder should be
displayed
Folder is seen Pass
3 User view the
permissions
of folders and
documents
Permissions must be
seen by user
Permissions
displayed
Pass
4 User view the
auditing data
Audit data must be
displayed
Audit data is
seen by user
Pass
DDU (Faculty of Tech., Dept. of IT) 53
User Manual
USER MANUAL
9.1DESCRIPTIONThis manual describes the working and use of the project so as to help the end user and
get them familiar with the features.
Our project is divided into three levels. These are:-
1. Source Level
2. DWH Level
3. View Level
The source level is the back-end of our project i.e. Alfresco Database. The DWH level is
PostGreSQL, used in creating our Data Mart. And the view level is the Pentaho tools.
The users will be able to see the view level of the project, specifically the Pentaho
Business Analytics tool where the published reports are deployed. Once in the BA
dashboard, the user can use many functionalities of it. The functionalities are listed
below:-
1. Login Page
2. View Reports
3. Scheduling
4. Administration
9.2 LOGIN PAGEBefore using the BA server, a user has to login into the server using his assigned user
name and password so that the system knows which user has accessed the server and at
what time. This helps in security purposes.
To login, we have to follow the steps below:-
DDU (Faculty of Tech., Dept. of IT) 54
User Manual
1. We have to go the BI server folder using command line prompt. After we have
changed the directory to BI server, we need to start Pentaho.
Figure 9.1 Login Step 1
2. Once we login, the system automatically loads runs Apache tomcat.
Figure 9.2 Login Step 2
DDU (Faculty of Tech., Dept. of IT) 55
User Manual
3. If the tomcat doesn’t find any error, it opens the user console of Pentaho BA
server. The user can now login the server using their own user name and
password.
Figure 9.3 Login Step 3
9.3 VIEW REPORTSThe main requirement of the user is to view the reports on the Web Browser so as to
make decisions among various other uses. To do that, the user has to follow these steps:-
1. Once we login, Home screen opens up as shown in the given figure. For
viewing reports, we have to select ‘Browse Files’ (1) from the drop down list.
DDU (Faculty of Tech., Dept. of IT) 56
User Manual
Figure 9.4 View Reports Step 1
2. Once we select the ‘Browse Files’ options, the console opens up the ‘Folders’
(2) in Home and the associated ‘Files’ (3) of the folder we select in file box.
There is also an option of ‘Folder Actions’ (4) provided in the console which
helps in various functions like creating a new folder, deleting a folder etc.
3.To view a report, we have to select the report from the file box. For example, if
we have to see the report of the Documents Permissions, we need to click at
docpermission-rep (5) file in the file box. It will open the Documents Permissions
report (6) on the browser.
DDU (Faculty of Tech., Dept. of IT) 57
User Manual
Figure 9.5 View Reports Step 2
4. We can apply filters in the report. For example, in this report, we can filter and
list the document according to the permissions by selecting the appropriate
permission (7). Here, we have selected ‘Read’ permission from the ‘select
permissions’ filter.
Also, we can view reports in different styles selecting the appropriate style from
‘Output Type’ (8). Here, we have selected HTML (Single Page) type.
9.3 SCHEDULINGYou can schedule reports to run automatically. All of your active scheduled reports
appear in the list of schedules, which you can get to by clicking the Home drop-down
menu, then the Schedules link, in the upper-left corner of the User Console page. You can
also access the list of schedules from the Browse Files page, if you have a report selected.
DDU (Faculty of Tech., Dept. of IT) 58
User Manual
The list of schedules shows which reports are scheduled to run, the recurrence pattern for
the schedule, when it was last run, when it is set to run again, and the current state of the
schedule.
Figure 9.6 Scheduling Page
Table 9.1 Scheduling options
Item Name Function
Schedules indicat
or
Indicates the current User Console perspective
that you are using. Schedules displays a list
ofschedules that you create, a toolbar to work
with your schedules, and a list of times that
your schedules are blocked from running.
Schedule Name Lists your schedules by the name you assign to
them. Click the arrow next to Schedule Nameto
sort schedules alphabetically in ascending or
descending order.
DDU (Faculty of Tech., Dept. of IT) 59
User Manual
Item Name Function
Repeats Describes how often the schedule is set to run.
Source File Displays the name of the file associated with the
schedule.
Output Location Shows the location that the scheduled report is
saved.
Last Run Shows the last time and date when the schedule
was run.
Next Run Shows the next time and date when the schedule
will run again.
Status Indicates the current Status of the schedule. The
state can be either Normal or Paused.
Blockout Times Lists the times that all schedules are blocked
from running.
You can edit and maintain each of your schedules by using the controls above the
schedules list, on the right end of the toolbar.
Table 9.2 Scheduling Controls
Icon Name Function
Refresh Refreshes the list of schedules.
Run Now Runs a selected schedule(s) at will.
Stop Scheduled
Task
Pauses a specified schedule. Use Start
Schedule to start paused jobs.
Start Scheduled
Task
Resumes a previously stopped schedule.
DDU (Faculty of Tech., Dept. of IT) 60
User Manual
Icon Name Function
Edit Scheduled
Task
Edits the details of an existing schedule.
Remove
Scheduled Task
Deletes a specified schedule. If the schedule is
currently running, it continues to run, but it
will not run again.
9.4 ADMINISTRATIONThe User Console has one unified place, called the Administration page, where people
logged in with a role that has permissions to administer security can perform system
configuration and maintenance tasks. If you see Administration in the left drop-down
menu on the User Console Home page, you can click it to reveal menu items having to do
with administration of the BA Server. If you do not have administration privileges,
Administration does not appear on the home page.
Figure 9.7 Administration Page
DDU (Faculty of Tech., Dept. of IT) 61
User Manual
Table 9.3 Administration Options
Item Control Name Function
1 Administration Open the Administration perspective of
the User Console. The Administration
perspective enables you to set up users,
configure the mail server, change
authentication settings on the BA
Server, and install software licenses for
Pentaho.
2 Users & Roles Manage the Penatho users or roles for
the BA Server.
3 Authentication Set the security provider for the BA
Server to either the default Pentaho
Security or LDAP/Active Directory.
4 Mail Server Set up the outgoing email server and
the account used to send reports
through email.
5 Licenses Manage Pentaho software licenses.
6 Settings Manage settings for deleting older
generated files, either manually or by
creating a schedule for deletion.
DDU (Faculty of Tech., Dept. of IT) 62
Limitations and Future Enhancements
LIMITATIONS AND FUTURE ENHANCEMENTS
10.1 LIMITATIONS All the data is stored in single repository in Alfresco. In case of improper
management in backup of data, there are chances of data loss.
Since community edition of Pentaho Data Integration Tool had limited number of
functionalities, scheduling had to be done manually.
10.2 FUTURE ENHANCEMENTS We could compress 97 tables of Alfresco to 29 tables in data warehouse. This
could be further reduced in future so as to increase the efficiency.
Sophisticated requirements like hyperlink functions and ticket generation for
employees can be done.
DDU (Faculty of Tech., Dept. of IT) 63
Conclusion and Discussion
CONCLUSION AND DISCUSSION
11.1 SELF ANALYSIS OF PROJECT VIABILITIES
11.1.1 Self AnalysisWe have created an information repository i.e. a data warehouse of an already existing
database system Alfresco. We have successfully installed the application and tested its
performance on several fronts. We have successfully completed validation testing. This
project task has been accomplished in such a way that it incorporates several features
demanded for present report generation and decision making requirements.
11.1.2 Project ViabilitiesThis project is successfully completed and is viable to be used in the Institute of Plasma
Research as a tool for generating reports according to the data stored in their database,
Alfresco. These reports are user-friendly with strong GUI support using a host of
graphical options like pie-charts, line graphs, bar charts etc. Decision making becomes
easier for the management department because of these reports.
11.2 PROBLEMS ENCOUNTERED AND POSSIBLE SOLUTIONS Alfresco was a new system, which we have never used before. For three to four
weeks, it was difficult to understand all its functionalities and working. So it took
time to understand full-fledged working of these technologies.
Alfresco GUI was not accessible in both our computers, due to which, we had to
install PostGreSQL and SQuirreL database systems.
It took time to finalize the ETL and reporting tools. We finally zeroed it down to
Pentaho over JasperSoft and BIRT.
DDU (Faculty of Tech., Dept. of IT) 64
Conclusion and Discussion
Pentaho Enterprises is basically a collection of tools. Each stage of our project
could be done by a particular tool/system. Thus, we had to get ourselves familiar
with a host of Pentaho tools.
Alfresco Audit Analysis and Reporting (A.A.A.R.) had not converted many of
our tables while transforming into data warehouse. Thus, we had to do manually.
11.3 SUMMARY OF PROJECT WORK
PROJECT TITLE
DATA AND BUSINESS PROCESS INTELLIGENCE
It is a project based on the subject of data mining. Data Warehouse is created
from where data is used to create user-friendly reports.
PROJECT PLATFORM
PENTAHO
It is an open-source provider of reporting, analysis, dashboard, data mining and
workflow capabilities.
SOFTWARE USED
Windows/Linux based system
PostgreSQL Database
SQuirreL Database
Alfresco ECM
Pentaho Community Edition 5.0 (PDI, Reporting Tool, BI Server)
Alfresco Audit Analyzing and Reporting tool
Notepad++
DOCUMENTATION TOOLS
VISIO 2013
DDU (Faculty of Tech., Dept. of IT) 65
Conclusion and Discussion
WORD 2007
EXCEL 2007
INTERNAL PROJECT GUIDE
PROF. R.S. CHHAJED
EXTERNAL PROJECT GUIDE
MR. VIJAY PATEL
COMPANY
INSTITUTE FOR PLASMA RESEARCH
SUBMITTED BY
BHAGAT FARIDA H.
SINGH SWATI
SUBMITTED TO
DHARAMSINH DESAI UNIVERSITY
PROJECT DURATION
8TH DEC 2014 TO 28TH MARCH 2015
DDU (Faculty of Tech., Dept. of IT) 66
References
REFERENCES
http://wiki.pentaho.com/display/Reporting/01.+Creating+Your+First+Report
http://infocenter.pentaho.com/help/index.jsp?topic=%2Freport_designer_user_guide%2Ftask_adding_hyperlinks.html
http://www.robertomarchetto.com/pentaho_report_parameter_example
http://docs.alfresco.com/4.2/concepts/alfresco-arch-about.html
http://fcorti.com/alfresco-audit-analysis-reporting/aaar-description-of-the-solution/aaar-pentaho-data-integration/
http://en.wikipedia.org/wiki/Pentaho
http://www.joyofdata.de/blog/getting-started-with-pentaho-bi-server-5-mondrian-and-saiku/
https://technet.microsoft.com/en-us/library/aa933151(v=sql.80).aspx
http://datawarehouse4u.info/OLTP-vs-OLAP.html
DDU (Faculty of Tech., Dept. of IT) 67