lds assignment mpaani

2
m.Paani Lead Data Science Role Assignment Question 1 For the first question, you will dive into and analyse daily coal stock reports of Indian thermal power plants for the years 2009-2012. The data is available as part of the Republic of India’s Open Government Data (OGD) Initiative and can be accessed at http://bit.ly/1yfkXZN or by searching for “Coal Statement of Thermal Power Stations” at data.gov.in. You will: 1) Arrange the 48 CSV files into a continuous time series database or data frame. We are especially interested in how you deal with missing, erroneous, and non-uniform data? Briefly explain how you handled missing and erroneous data in the time series. 2) Choose one of the following tasks and WOW us with your modelling skills and data-driven insights: - Segment the coal thermal plants into distinguishable groups based on a clustering strategy of your choice. You are free to define the number of groups and methodology yourself but make sure you have a method to assess the separability of your resultant groups. OR - Forecast whether coal stocks will reach “Super-Critical” state the following (next) day for any given power plant. How accurate is your model? What are its limitations and strengths? Instructions: Document and present your answers and results in a document, presentation, web application, or medium of your choice. Make sure to attach your code (if any) and heavily comment it so we can really dive into your thought process. Have fun!

Upload: ramesh158

Post on 15-Sep-2015

213 views

Category:

Documents


1 download

DESCRIPTION

LDS

TRANSCRIPT

  • m.Paani Lead Data Science Role Assignment Question 1

    For the first question, you will dive into and analyse daily coal stock reports of Indian thermal power

    plants for the years 2009-2012. The data is available as part of the Republic of Indias Open

    Government Data (OGD) Initiative and can be accessed at http://bit.ly/1yfkXZN or by searching for

    Coal Statement of Thermal Power Stations at data.gov.in. You will:

    1) Arrange the 48 CSV files into a continuous time series database or data frame. We are

    especially interested in how you deal with missing, erroneous, and non-uniform data? Briefly

    explain how you handled missing and erroneous data in the time series.

    2) Choose one of the following tasks and WOW us with your modelling skills and data-driven

    insights:

    - Segment the coal thermal plants into distinguishable groups based on a clustering

    strategy of your choice. You are free to define the number of groups and methodology

    yourself but make sure you have a method to assess the separability of your resultant

    groups.

    OR

    - Forecast whether coal stocks will reach Super-Critical state the following (next) day for

    any given power plant. How accurate is your model? What are its limitations and

    strengths?

    Instructions: Document and present your answers and results in a document, presentation,

    web application, or medium of your choice. Make sure to attach your code (if any) and

    heavily comment it so we can really dive into your thought process. Have fun!

  • m.Paani Lead Data Science Role Assignment Question 2

    For the second question, we would like you to imagine that you are the Lead Data Scientist for a

    loyalty company which handles millions of shopping transactions a day. These transactions are

    carried out both online and through physical retail outlets. Managements main objective is to

    make the transactions simple and easy so members can earn points towards redeeming the

    reward of their choice. However, it has come to your attention that there is a small but steady

    increase in the number of fraud and identity theft incidents among the companys customer base.

    You and your team of 5 Data Scientists (1 Senior Data Scientist, 2 Junior Data Scientists, 1 Data

    Visualization Expert, and 1 GIS Expert) are entrusted with the task of detecting fraudulent behaviour

    efficiently in a real-time manner. As you are well aware, any false positives on your end will result in

    erosion of customer trust whereas false negatives cost the company serious money.

    Please:

    1) Develop a high-level workflow of how you propose to tackle the challenge of developing

    and implementing a real-time fraud detection engine that can sort through millions of

    transactions a day. We would like you to focus on but by no means limit yourself to the

    following points:

    Allocation of tasks and responsibilities among each member of your team

    What technology and platform you would be using to implement the fraud

    detection engine

    Model structure, execution, and validation

    How to handle the transactions that do get flagged as fraudulent?

    The use of data visualization in enhancing the fraud discovery process

    2) Present your system in a format that can effectively communicate your thought process.

    Be ready to discuss your proposed solution to a member of the mPaani data science

    team once you have submitted the assignment.