lds assignment mpaani

m.Paani Lead Data Science Role Assignment Question 1

For the first question, you will dive into and analyse daily coal stock reports of Indian thermal power

plants for the years 2009-2012. The data is available as part of the Republic of Indias Open

Government Data (OGD) Initiative and can be accessed at http://bit.ly/1yfkXZN or by searching for

Coal Statement of Thermal Power Stations at data.gov.in. You will:

1) Arrange the 48 CSV files into a continuous time series database or data frame. We are

especially interested in how you deal with missing, erroneous, and non-uniform data? Briefly

explain how you handled missing and erroneous data in the time series.

2) Choose one of the following tasks and WOW us with your modelling skills and data-driven

insights:

- Segment the coal thermal plants into distinguishable groups based on a clustering

strategy of your choice. You are free to define the number of groups and methodology

yourself but make sure you have a method to assess the separability of your resultant

groups.

OR

- Forecast whether coal stocks will reach Super-Critical state the following (next) day for

any given power plant. How accurate is your model? What are its limitations and

strengths?

Instructions: Document and present your answers and results in a document, presentation,

web application, or medium of your choice. Make sure to attach your code (if any) and

heavily comment it so we can really dive into your thought process. Have fun!

m.Paani Lead Data Science Role Assignment Question 2

For the second question, we would like you to imagine that you are the Lead Data Scientist for a

loyalty company which handles millions of shopping transactions a day. These transactions are

carried out both online and through physical retail outlets. Managements main objective is to

make the transactions simple and easy so members can earn points towards redeeming the

reward of their choice. However, it has come to your attention that there is a small but steady

increase in the number of fraud and identity theft incidents among the companys customer base.

You and your team of 5 Data Scientists (1 Senior Data Scientist, 2 Junior Data Scientists, 1 Data

Visualization Expert, and 1 GIS Expert) are entrusted with the task of detecting fraudulent behaviour

efficiently in a real-time manner. As you are well aware, any false positives on your end will result in

erosion of customer trust whereas false negatives cost the company serious money.

Please:

1) Develop a high-level workflow of how you propose to tackle the challenge of developing

and implementing a real-time fraud detection engine that can sort through millions of

transactions a day. We would like you to focus on but by no means limit yourself to the

following points:

Allocation of tasks and responsibilities among each member of your team

What technology and platform you would be using to implement the fraud

detection engine

Model structure, execution, and validation

How to handle the transactions that do get flagged as fraudulent?

The use of data visualization in enhancing the fraud discovery process

2) Present your system in a format that can effectively communicate your thought process.

Be ready to discuss your proposed solution to a member of the mPaani data science

team once you have submitted the assignment.

lds assignment mpaani

Documents