deep-hybriddataclouddigital.csic.es/bitstream/10261/164311/6/deep-na2-d2.1-annex3.pdf ·...

DEEP-HybridDataCloud

PLANT CLASSIFICATION WITH DEEP LEARNING

DELIVERABLE: D2.1 (ANNEX 3)

Document identifier: DEEP-NA2-D2.1-Annex3-V8.0.odt

Date: 16/05/2018

Activity: WP2

Lead partner: HMGU

Status: FINAL

Dissemination level: PUBLIC

Permalink: http://hdl.handle.net/10261/164311

DEEP-HybridDataCloud – 777435 1

http://hdl.handle.net/10261/164311

Copyright NoticeCopyright © Members of the DEEP-HybridDataCloud Collaboration, 2017-2020.

Delivery Slip

Name Partner/Activity Date

From Wolfgang zu Castell HMGU / WP2 16/05/2018

Reviewed byIgnacio BlanquerFernando Aguilar

Álvaro López

UPVCSICCSIC

23/04/2018

Approved by Steering Commitee 27/04/2018

Document Log

Issue Date Comment Author/Partner

V1.0 26/02/2018 First template version Álvaro López / CSIC

V2.0 26/03/2018 TOCWolfgang zu Castell / HMGU

Marcus Hardt / KITLara Lloret / CSIC

V3.0 29/03/2018 First version Daniel García / CSICIgnacio Heredia / CSIC

V4.0 20/04/2018 Use Case Description Updated Daniel García / CSICIgnacio Heredia / CSIC

V5.0 20/04/2018 Use Case Requirements Updated Daniel García / CSICIgnacio Heredia / CSIC

V6.0 23/04/2018 External Review Ignacio Blanquer / UPVFernando Aguilar / CSIC

V7.0 24/04/2018 Internal Review Álvaro López / CSIC

V8.0 25/05/2018 Final Version Daniel García / CSICIgnacio Heredia / CSIC


Table of Contents1. Executive Summary.........................................................................................................................4

1.1. Identification............................................................................................................................41.2. Brief description of the Use Case............................................................................................41.3. Expectations in the framework of the Deep Hybrid Datacloud Project...................................41.4. Expected results and derived impact........................................................................................51.5 References useful to understand the Case Study......................................................................5

2. Introduction and Use Case..............................................................................................................52.1. Presentation on the Use Case...................................................................................................52.2. Description of the research community...................................................................................52.3. Current Status and Plan for this Use Case...............................................................................62.4. Identification of the KEY scientific goals................................................................................62.5. Description of potential development......................................................................................6

3. Technical description of the use case...............................................................................................73.1. User categories and roles.........................................................................................................73.2. General description of datasets/formats/software used............................................................73.3. Technological (S/T) requirements............................................................................................73.4. Identification of required services...........................................................................................83.5. Description of the use case in terms of workflows..................................................................8

4. Data Requirements..........................................................................................................................84.1. Access Control.........................................................................................................................9

4.1.1. Privacy.............................................................................................................................94.1.2. Location...........................................................................................................................94.1.3. Sharing............................................................................................................................9

4.2. Capacity (Data Volume)...........................................................................................................94.2.1. Test Data / Production Data............................................................................................94.2.2. Transfer rate requirements.............................................................................................104.2.3. Preservation requirements.............................................................................................10

5. Infrastructure and technical requirements.....................................................................................105.1. Expectation regarding the advantage through the use of technology....................................105.2. Expectations regarding e-Infrastructure use..........................................................................10

5.2.1. Networking....................................................................................................................105.2.2. Computing: Clusters, Grid, Cloud, Supercomputing resources....................................105.2.3. Storage...........................................................................................................................10

5.3. On (user-facing) monitoring (and Accounting)......................................................................115.4. On authentication and authorization infrastructure (AAI).....................................................11

6. Formal list of requirements............................................................................................................117. Use case summary table................................................................................................................128. References.....................................................................................................................................13


1. Executive SummaryIn this case study we will develop a tool to automatically classify images of plants based on state-of-the-art convolutional neural networks. This tool will be trained using available images providedby communities of naturalists. We then will deploy this tool so to give access to the user to theexpert machine predictions and will try to integrate the machine knowledge with communities ofnaturalists.

We will also offer to the user to retrain this tool on his own dataset (of let's say animals), thuscreating an on-demand image classification tool.

1.1. IdentificationName Plant Classification with Deep Learning

Institution/Partner CSIC

Contacts • Ignacio Heredia (CSIC)[email protected]

1.2. Brief description of the Use CaseIn this era of big data where everybody is equipped with a smartphone device, collaborative citizenscience platforms have sprung enabling users to easily share easily their observations. We intend tocollect those freely available observations to build deep learning tools to around them.

This Use Case describes a tool to automatically identify plant species from images using DeepLearning. This can be very helpful to automatically monitor biodiversity at a large-scale andtherefore relieving scientist from the tedious task of having to hand-label images.

In addition this tool can be easily retrained to perform image classification on a different datasetsby a user without expert (machine learning) knowledge.

1.3. Expectations in the framework of the Deep Hybrid Datacloud Project

· Development of tools that allow download and save large datasets automatically, as well as anenvironment to process this large datasets and save the derived product.

· Development of deep learning tools to analyse and get meaningful insights of those data.

1.4. Expected results and derived impactExpected results


mailto:[email protected]

• We expect to deliver a fully functioning tool to automatically classify plant images intoseveral species categories.

• We expect to integrate this tool with biodiversity communities to help them monitorbiodiversity.

Derived impact

• We expect that wide use of this tool will help scientist alleviating the tedious job ofclassifying images and letting them focus instead of more creative tasks.

• We expect that the adoption of this tool will make possible to scale several orders ofmagnitude the amount of data that are currently hand-processed by experts.

• We expect that this tool will help democratize the biodiversity knowledge by allowinganyone to have access to expert predictions (made by the machine).

1.5 References useful to understand the Case Study• Large-Scale Plant Classification with Deep Neural Networks , Ignacio Heredia, Proceedings

of the Computing Frontiers Conference (2017), 259-262.

2. Introduction and Use Case

2.1. Presentation on the Use CaseIn this case study we will develop a tool to automatically classify images of plants based on state-of-the-art convolutional neural networks. This tool will be trained using available images providedby communities of naturalists. We then will deploy this tool so to give access to the user to theexpert machine predictions and will try to integrate the machine knowledge with communities ofnaturalists.

We will also offer to the user to retrain this tool on his own dataset (of let's say animals), thuscreating an on-demand image classification tool.

2.2. Description of the research communityThe community is composed by developers and users:

• Manager/Developer: Developers will be in charge of the dataset's preprocessing and thedata ingestion. They will also develop the deep learning tools to perform classification.

• Researcher/User: Users will be able to use the tool as it is (Level 3 User) or retrain the toolto perform image classification on their own dataset (Level 2 User). Level 1 User (usermachine learning knowledge) could tweak the network architecture and hyperparameters tomodify the training pipeline but this is out of the scope of this Use Case.


https://arxiv.org/abs/1706.03736

2.3. Current Status and Plan for this Use CaseRight now have a web service that performs image classification of plants with deep learning aswell as another service to identify conus. An early tool for integration has been developed tointegrate the classifications with biodiversity communities like Natusfera or iNaturalist. All thesetools can be found at http://deep.ifca.es. The plan is to further develop the existing tools (byincluding more species or modifying the deep learning architecture to make the classification morereliable) and develop new tools (like one including animals).

All portals are hosted at IFCA servers.

2.4. Identification of the KEY scientific goals• Produce a tool that is able to classify plants species from images. • Have the results produced by the developed tools validated by biodiversity experts. • Deploy this tool to automated monitoring of biodiversity. This means that we will make the

tool available to be queried through an API so that communities and scientists canautomatically retrieve predictions for each new observation they have.

2.5. Description of potential developmentThere a multiple possible beneficiaries of this tool at different levels:

• Citizens could use these tools to gain insight in their daily life.

• Communities of naturalists like iNaturalist or Natusfera to have a second opinion whenexperts disagree or use it to suggest a possible species to experts to make hand labelingfaster or in even fully in charge of the hand labeling process if experts cannot deal with thehuge amounts of data provided by users/sensors.

• Government agencies could use these tools to better monitors ecosystems and thereforedesign better policies regarding the environment.

• European platforms for biodiversity data like LifeWatch could use it for the same reasons.

3. Technical description of the use case

3.1. User categories and roles• Level 3 user: This is a user that will typically use the plant classification tool to identify his

own plants photos.

• Level 2 user: This user will be able to easily retrain the tool on his own dataset and create anew tools (let's say an animal classification tool).


http://deep.ifca.es/

3.2. General description of datasets/formats/software usedDatasets

• Biodiversity observations: Those are photos taken by citizen scientists of species in thewild. They are usually in a standard photo format (.JPEG, .PNG). This dataset will bedivided into a training, validation and testing dataset.

Software

Most analysis and development have been carried in Python, using:

• Modules for scientific data analysis like Numpy and Scipy. • Modules for image processing like OpenCV and PIL. • Deep learning frameworks like Tensorflow and Pytorch and wrappers around them (Keras,

Tensorlayer).

3.3. Technological (S/T) requirements• Storage in the order of TB to store biodiversity images.

• Powerful GPUs to efficiently train the deep learning applications.


3.4. Identification of required services

3.5. Description of the use case in terms of workflows

Figure 1. Typical workflow for training an image classification application

4. Data RequirementsBiodiversity data are collected from portals like iNaturalist (https://www.inaturalist.org/) orNatusfera (http://natusfera.gbif.es/), that enable users to easily upload their observation photos.Observations are then classified by the community. Additionally one could merge data from other(minor) sources like PlantNet (https://identify.plantnet-project.org/) to extended the speciesdatabase.

Observations from iNaturalist and Natusfera can be downloaded through an official API using asimple Python script. Observations from PlantNet have to be downloaded through web scraping.Along with the images we will save the metadata of the images (author, data, original url, species,ID, etc) in a json format. Updates to the database can be periodically launched using the samescript to query the API for observations made in a new time range.


https://identify.plantnet-project.org/

http://natusfera.gbif.es/

https://www.inaturalist.org/

So the workflow will be to launch first a dataset retrieval. On this dataset we train the application.Each time one would want to improve the application, one would launch a dataset update to addmore new images to the training dataset and then launch the network retraining (finetuning).

4.1. Access Control

4.1.1. PrivacyData (images) are publicly available on the web and thus there are no privacy concerns.

4.1.2. LocationWe need a local copy of the data to have fast access to them during training. This copy can beautomatically made by accessing the data through the iNaturalist/Natusfera API.

4.1.3. SharingImage sharing

iNaturalist/Natusfera images are released by default under a Creative Commons Attribution-Non-Commercial license [iNaturalistPol]. Some people have chosen to revoke the Creative Commonslicense to retain complete legal control over copies of their photos, while others have chosendifferent versions of the Creative Commons license (or have chosen to waive their copyrightentirely in the form of the CC0 declaration). PlantNet images are released with a Creative-CommonAttribution-ShareAlike 2.0 license [PlantNetPol].

It is important to note that once trained, the application (the neural network) does not make furtheruse of the data.

Model sharing

We plan to make the trained model available for everyone wanting to run the image classificationservice on their local resources.

4.2. Capacity (Data Volume)

4.2.1. Test Data / Production DataThe data capacity will be in the order of TBs. For example 1,5 million images of plants that will beused to train the tools will take around 2 TB of storage.

4.2.2. Transfer rate requirementsDataset transfer from user local storage to cloud storage unit (in the retraining case)

Data transfer rate should be high enough in the case of user willing to retrain the tool with theirown dataset.

Transfer during training from cloud storage unit to GPU


Data transfer to GPU is the main speed bottleneck when training deep learning tools so this shouldbe as high as possible.

4.2.3. Preservation requirementsAfter the training is complete, the plant images are no longer required to be stored. Howeverkeeping them could be interesting for future retraining.

5. Infrastructure and technical requirements

5.1. Expectation regarding the advantage through the use of technology

5.2. Expectations regarding e-Infrastructure use

5.2.1. NetworkingAccessibility of the data from both CPUs and GPUs infrastructure with low latencyinterconnections among cores and nodes.

5.2.2. Computing: Clusters, Grid, Cloud, Supercomputing resources.Infrastructure for development

We need servers with one or several GPU units for fast training of deep learning applications.

Infrastructure for deployment

The deployment needs servers for hosting the web services that will enable users to access thetools. In the case of services for serving the deep learning tools, servers should preferably haveGPUs to make fast predictions (although they can function also with CPUs).

5.2.3. StorageInfrastructure for development

The development of the tools needs lots of memory for storing the images.

Infrastructure for deployment

The deployment doesn't have any particular requirement in terms of storage.


5.3. On (user-facing) monitoring (and Accounting)

5.4. On authentication and authorization infrastructure (AAI)

6. Formal list of requirementsSee table below.


7. Use case summary tableUse Case Plant classification with Deep Learning

Software and services used Python, web services that will enable users to access the tools

Machine / Deep Learning tools

• Python: Numpy, Scipy. • Image processing: OpenCV, PIL. • DL: start with Tensorflow + Keras (Tensorlayer?), in future

may go for pyTorch

Computing Multi-GPU nodes (GPUs satisfying for deep learning)

Memory requirements At least 8 GB (GPU)

Networking

• From user local storage to cloud storage high enough to transfer user’s own dataset.

• Between a cloud storage and GPU as fast as possible, as it is a typical bottleneck.

Storage requirements (permanent, temporal) • In the order of 10 TB (dataset for training)

External data access requirements

• Portals like iNaturalist (https://www.inaturalist.org) or Natusfera (http://natusfera.gbif.es) through an official API.

• PlantNet (https://identify.plantnet-project.org) via web scraping.

Privacy No privacy concerns as data are publicly available

Other requirements Authentication and Authorization must be compatible with the architecture designed in the context of the AARC2 project

Other comments

• This use case will have millions of images so training will need to be distributed.

• Going to run/develop the use case in a local machine, later willneed containerized solution.

Relevant references or URLs

Large-Scale Plant Classification with Deep Neural Networks, Ignacio Heredia, Proceedings of the Computing Frontiers Conference (2017), 259-262; arXiv:1706.03736

8. References[iNaturalistPol] Image copyright policy from iNaturalist: https://www.inaturalist.org/pages/help#photo-use


https://www.inaturalist.org/pages/help#photo-use

https://arxiv.org/abs/1706.03736

https://identify.plantnet-project.org/

http://natusfera.gbif.es/

https://www.inaturalist.org/

[PlantNetPol] Image copyright policy from PlantNet:

https://identify.plantnet-project.org/api/about/terms


https://identify.plantnet-project.org/api/about/terms

deep-hybriddataclouddigital.csic.es/bitstream/10261/164311/6/deep-na2-d2.1-annex3.pdf ·...

Documents