d5.1 eubra-bigsea software architecture v1.1...abstract: europe - brazil collaboration of big data...

28
www.eubra-bigsea.eu | [email protected] |@bigsea_eubr D5.1: EUBra-BIGSEA Software Architecture Author(s) Daniele Lezzi (BSC), Walter dos Santos Filho (UFMG) Status Final Version V1.0 Date 05.10.2016 Dissemination Level X PU: Public PP: Restricted to other programme participants (including the Commission) RE: Restricted to a group specified by the consortium (including the Commission) CO: Confidential, only for members of the consortium (including the Commission) Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale research project funded by the European Commission under the Cooperation Programme, and the Ministry of Science and Technology (MCT) of Brazil in the frame of the third European-Brazilian coordinated call. The document has been produced with the co-funding of the European Commission and the MCT. The purpose of this report is the design of the software architecture of the EUBra-BIGSEA platform. The document describes the overall functioning and interactions between the platform components, and serves as development roadmap for the developers of the project. EUBra-BIGSEA is funded by the European Commission under the Cooperation Programme, Horizon 2020 grant agreement No 690116. Este projeto é resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informação e Comunicação (TIC), anunciada pelo Ministério de Ciência, Tecnologia e Inovação (MCTI)

Upload: others

Post on 29-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr

D5.1:EUBra-BIGSEASoftwareArchitecture Author(s) DanieleLezzi(BSC),WalterdosSantosFilho(UFMG)

Status Final

Version V1.0

Date 05.10.2016

DisseminationLevel X PU:Public PP:Restrictedtootherprogrammeparticipants(includingtheCommission) RE:Restrictedtoagroupspecifiedbytheconsortium(includingtheCommission) CO:Confidential,onlyformembersoftheconsortium(includingtheCommission)

Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-CentricApplications (EUBra-BIGSEA) is a medium-scale research project funded by the EuropeanCommissionundertheCooperationProgramme,andtheMinistryofScienceandTechnology(MCT)of Brazil in the frame of the third European-Brazilian coordinated call. The document has beenproducedwiththeco-fundingoftheEuropeanCommissionandtheMCT. ThepurposeofthisreportisthedesignofthesoftwarearchitectureoftheEUBra-BIGSEAplatform. Thedocumentdescribestheoverallfunctioningandinteractionsbetweentheplatformcomponents,andservesasdevelopmentroadmapforthedevelopersoftheproject.

EUBra-BIGSEAisfundedbytheEuropeanCommissionunderthe CooperationProgramme,Horizon2020grantagreementNo690116.

Esteprojetoéresultanteda3aChamadaCoordenadaBR-UEemTecnologiasdaInformação eComunicação(TIC),anunciadapeloMinistériodeCiência,TecnologiaeInovação(MCTI)

Page 2: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 2

Documentidentifier:EUBRABIGSEA–WP5-D5.1 Deliverablelead BSC Relatedworkpackage WP5 Author(s) DanieleLezzi(BSC),WalterdosSantosFilho(UFMG) Contributor(s) IgnacioBlanquer(UPV),DorgivalGuedes(UFMG),SandroFiore(CMCC),Danilo

Ardagna(POLIMI) Duedate 30/09/2016 Actualsubmissiondate 05/10/2016

Reviewedby GermánMoltó(UPV),DorgivalGuedes(UFMG)

Approvedby PMB

StartdateofProject 01/01/2016 Duration 24months Keywords BigData,programmingmodels,architecturedesign,analytics

Versioningandcontributionhistory

Version Date Authors Notes

0.1 31/08/2016 DanieleLezzi(BSC) TableofContents

0.2 DanieleLezzi(BSC) SectionsaboutCOMPSs

0.3 15/09/2016 WalterSantos(UFMG) SectionaboutLemonade

0.4 21/09/2016 DanieleLezzi(BSC) Requirementsandtechnologyevaluationsections

0.5 21/09/2016 NunoAntunes(UC) Securitysection

0.6 26/09/2016 DanieleLezzi(BSC) Generaledits

0.7 27/09/2016 DanieleLezzi(BSC) Editsandformatting

0.8 29/09/2016 Danilo Ardagna (POLIMI),DorgivalGuedes(UFMG)

Revisions

0.9 30/09/2016 IgnacioBlanquer(UPV) Revision

1.0 03/10/2016 DanieleLezzi Finalversionreviewed

Copyright notice: This work is licensed under the Creative Commons CC-BY 4.0 license. To view a copy of this license, visithttps://creativecommons.org/licenses/by/4.0. Disclaimer:ThecontentofthedocumenthereinisthesoleresponsibilityofthepublishersanditdoesnotnecessarilyrepresenttheviewsexpressedbytheEuropeanCommissionoritsservices. Whiletheinformationcontainedinthedocumentisbelievedtobeaccurate,theauthor(s)oranyotherparticipantintheEUBra-BIGSEAConsortiummakenowarrantyofanykindwithregardtothismaterialincluding,butnotlimitedtotheimpliedwarrantiesofmerchantabilityandfitnessforaparticularpurpose. NeithertheEUBra-BIGSEAConsortiumnoranyofitsmembers,theirofficers,employeesoragentsshallberesponsibleorliableinnegligenceorotherwisehowsoeverinrespectofanyinaccuracyoromissionherein. WithoutderogatingfromthegeneralityoftheforegoingneithertheEUBra-BIGSEAConsortiumnoranyofitsmembers,theirofficers,employees or agents shall be liable for any direct or indirect or consequential loss or damage caused by or arising from anyinformationadviceorinaccuracyoromissionherein.

Page 3: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 3

TABLEOFCONTENT

EXECUTIVE SUMMARY ................................................................................................................... 41 Introduction ................................................................................................................................ 5

1.1 Scope of the Document ...................................................................................................... 51.2 Target Audience .................................................................................................................. 51.3 Structure.............................................................................................................................. 5

2 EUBra-BIGSEA Architectural Overview ..................................................................................... 63 Programming Abstractions Layer Requirements ....................................................................... 8

3.1 Use Case and technical requirements ................................................................................ 83.2 Types of users .................................................................................................................... 9

4 Programming Abstractions Layer Design ................................................................................. 104.1 Architecture Design ........................................................................................................... 10

4.1.1 Programming frameworks .......................................................................................... 114.1.2 Support to QoS specification ..................................................................................... 164.1.3 Code generation and composition ............................................................................. 18

4.2 Application lifecycle ........................................................................................................... 214.3 Security aspects ................................................................................................................ 224.4 Technology Analysis ......................................................................................................... 25

5 Conclusions .............................................................................................................................. 266 References ............................................................................................................................... 277 GLOSSARY ............................................................................................................................. 28 LISTOFTABLES

Table 1 - QoS constraints specification ........................................................................................... 17Table 1 - Lemonade supported operations ..................................................................................... 20Table 2 - Requirements and technologies ....................................................................................... 26

LISTOFFIGURES

Figure 1 - High-level view of the EUBra-BIGSEA architecture .......................................................... 6Figure 2 - Detailed view of the software architecture ........................................................................ 7Figure 3 - Architecture of the Abstraction Layer .............................................................................. 10Figure 4 - Detailed view of WP5 components ................................................................................. 12Figure 5 - JSON specification of the QoS ....................................................................................... 18Figure 6 - Lemonade components and supported frameworks (in progress) .................................. 18Figure 7 - Lemonade Citron user interface ...................................................................................... 19Figure 8 - Application lifecycle diagram ........................................................................................... 22Figure 9 - Relation between WP5 and WP6 (adapted from D6.1) .................................................. 23

Page 4: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 4

EXECUTIVESUMMARYTheEUBra-BIGSEAprojectaimsatdevelopingasetofcloudservicesempoweringBigDataanalyticstoeasethedevelopmentofmassivedataprocessingapplications.EUBra-BIGSEAwilldevelopmodels,predictiveandreactivecloud infrastructureQoS techniques,efficientandscalableBigDataoperatorsandaprivacyandqualityanalysisframework,exposedtoseveralprogrammingenvironments.EUBra-BIGSEAaimsatcoveringgeneralrequirementsofmultipleapplicationareas,althoughitwillbeshowcasedinthetreatmentofmassiveconnectedsocietyinformation,andparticularlyinrouterecommendation.

Theabstractions layerprovides functionalities thatallowto transparentlybuildapplicationscomposedofdata operators mapped to different Big Data frameworks. On the other side, the integration with theinfrastructure layer, makes applications effectively scale across the infrastructure, providing also to thedevelopersappropriateabstractions to specifyQoSconstraintsandaunifiedprogramming interface thatincludescomputing,dataanalytics,andsecurityAPIs.

The programming layerwill be based on COMPSs and Spark that provide complementary capabilities tosatisfy the use cases requirements. Spark is an open source data processing framework. COMPSs is aprogrammingframeworkthataimstofacilitatetheparallelisationofexistingapplicationsthroughasimpleprogrammingmodelbasedonsequentialdevelopment.TheCOMPSsruntimeisinchargeofexploitingtheinherentconcurrencyofthecode,automaticallydetectingandenforcingthedatadependenciesbetweentasks and spawning these tasks to the available resources and provide scalability and elasticity featuresallowingthedynamicprovisionofresources.TheCOMPSsprogramminginterfacewillbeenhancedintheprojectthroughtheintegrationofQoSconstraintsandsecurityhints;theCOMPSsruntimewillbeextendedwith the support to Mesos [R04] in order to benefit from the proactive elasticity of EUBra-BIGSEAinfrastructure. Thefinalgoalistohaveanintegratedlayerthatprovidesbuildingblocks,developedwithCOMPSsandSpark,thatcouldbeimportedintheuserapplications.

Page 5: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 5

1 INTRODUCTION

1.1 ScopeoftheDocumentThisdocumentsummarizesthearchitecturaldesigntobeimplementedinthecourseoftheEUBra-BIGSEAproject.Wehighlighttheinterrelationsamongthedifferentworkpackagesinvolvedinthedecisionsadopted,andalsooutlinethereasoningbehindthechoicesmade.

ThisdocumentisintendedforgeneralreferencebutmostlyfocusesonthedesignoftheProgrammingModelAbstractionLayeranditsintegrationwiththeotherlayersforapplications(WP7),BigDataEcosystem(WP4),QoSinfrastructure(WP3)andsecurity(WP6)whosearchitectureshavebeendetailedinD7.1,D4.1,D3.1andD6.1respectively.

1.2 TargetAudienceThedocumentismainly intendedfor internaluse,althoughit ispubliclyreleased.ThemaintargetofthisdocumentistheglobalteamoftechnicalexpertsoftheEUBra-BIGSEA,includingWP3,WP4,WP5andWP6.

1.3 StructureTherestofthedocumentisstructuredasfollows;Section2containsahighlevelsummaryoftheBIGSEAarchitecture. Section 3 analyses the use cases requirements that are also related to the programmingframeworksandproposesalistofrequirementsspecificforthedesignoftheabstractionslayer.Section4addressesthemainobjectiveofthedocument,thedesignofthesoftwarearchitecturewithananalysisofthecomponentsselectedtoimplementthelayer.Section5concludesthedocumentprovidingatimelinefortheremainingimplementationactivities.

Page 6: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 6

2 EUBRA-BIGSEAARCHITECTURALOVERVIEWTheEUBra-BIGSEAgeneralarchitecture,asdescribedindeliverableD3.1,comprisesfourmainblocks:

● QoSCloudInfrastructureservices,whichintegratethemodellingoftheworkload,themonitoringoftheresources,theimplementationofverticalandhorizontalelasticityandthecontextualization.

● BigDataAnalytics services,whichprovideoperators to process hugedatasets andwhich canbeintegrated in the programming models. Analytics services are characterized in the QoS cloudinfrastructuremodelsof theunderlying layer,whichwillautomatically (orexplicitlydrivenbytheanalyticsservices)adjustresourcestotheexpectedworkloadandconsideringitsspecificities.

● Programming Models, which provide a higher-level programmatic framework and are alsocharacterized by the models of the infrastructure. The programming models will ease theparallelizationoftheapplicationsdevelopedontopofthem.

● Privacyand Security framework,whichprovides themeans to annotatedata andprocessing andensurestheproperprotectionofprivacyandsecurity.

Ontopofthosefourblocks,applicationsaredevelopedusingtheprogrammingmodelsandthedataanalyticsextensions.Applicationdevelopersareexpectedtousetheprogrammingmodelsandmayuseotherfeaturesofunderlyinglayers,suchastheuser-levelQoSmetrics.

Figure1showsthehigh-levelviewoftheEUBra-BIGSEAarchitecturedepictingtheinteractionsamongthemainblocks.

Figure 1 - High-level view of the EUBra-BIGSEA architecture

Figure 2 highlights the separation between the infrastructure components for the management of theresources(describedinD3.1)andthesoftwarecomponentsthataretargetedinthisdocument.Inparticular,WP5focusesonhowapplicationsarecomposedusingtheabstractionsprovidedbytheprogrammingmodelsandhowthoseapplicationsaredeployedbenefittingfromthehigh-availabilityandreliabilityfeaturesoftheinfrastructure.

Page 7: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 7

Figure 2 - Detailed view of the software architecture

As shown in figure 2, resources of the infrastructure aremanaged by a CloudManagement Framework(OpenNebulaorOpenStack)thatdeploysorundeploysVirtualMachines(VMs)whenrequested.TheseVMsbuilduptheMesoscluster,providingtheagentsondemand.TheclusterisconfiguredbytheInfrastructureManager(IM)andtheelasticityattheleveloftheresourcesismanagedthroughElasticComputingClustersintheCloud(EC3),whichmonitorstheMesosclustertodetecttheneedofresourcesandtheopportunitytopoweroffthem.TheMesosclusterisaccessedthroughascheduler.

Fromthelogicalcomponents-side,userswritetheirapplicationsusingtheLemonadeIDEwhichtransformsthemintoSparkandCOMPSscode.UserscanwritetheprogramsdirectlyinSparkorCOMPSsorportexistingapplicationswritteninJavaorPython.Proactivepoliciesgiveanestimationoftheresourcesneededbytheexecution,whicharereadjustedbytheMonitoringsystemwhichmonitorstheQoScompliance.

Page 8: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 8

3 PROGRAMMINGABSTRACTIONSLAYERREQUIREMENTS

3.1 UseCaseandtechnicalrequirements Thissectionprovidesasummaryoftherequirements,describedinD7.1,relevantforthedefinitionoftheabstractionlayer.

● RE.1.Batchjobs.Theinfrastructuremustsupportunrestrictedbatchexecutionofdataanalyticjobs.UnrestrictedinthesenseofnoQoS-boundedbatchjobs,wherelatencyisnotakeyissue.Singlejobswillbethosethatnormallycouldfitinmemory.

● RE.2.Bagoftasks.The infrastructuremustsupportunrestrictedbatchexecutionofabagofdataanalyticsjobs.ABagofjobsisamodelthatfitstheHigh-ThroughputComputing(HTC)paradigm.

● RE.3.QoSBatchjobs. TheinfrastructuremustexecutebatchjobswithassociatedQoS.Executionsshouldbecharacterizedontimeandcouldalsorequireaboundedbudgetexpressedinthemaximumresourcestimetobespent.TheschedulershouldadjusttheresourcestomeettheexpectedQoS.

● RE.4.Deadline-based jobs. The execution service of the infrastructure should provide deadline-basedexecutionrequests,whichwillhavetofinishatagiventime,andarecharacterizedintermsofresourcesandexpectedexecutiontime.Ifthedeadlineisnotfeasibleatsubmissiontime,itwillnotifytheuserandrunimmediatelyasresourcesareavailable.Ifnot,itwillscheduletheexecutionforthefuture.Iftheexecutiontakesplaceclosertothedeadline,thedatawillbemoreup-to-date.

● RE.5.Self-adaptingelasticity.Thealgorithmswillbedescribedinawaythattheinfrastructurecandedicatemore resources to fulfill theQoS. The infrastructuremust be reactive in both allocatedcomputing resourcesandallocatedmemory.The infrastructuremustbeself-adapting inorder toaccommodatetheworkloadpeaksthatcanappearinHTCapplications.

● RE.6.Shortjobs.Theinfrastructuremustsupporttheexecutionofshort-jobs,finishingininteractivetime,whichcouldarrivemassively(hundredsperminute).

● RE.7.Workflowsmanagement.Theinfrastructuremustsupporttheexecutionofbigdataworkflows,wheretheinputdataandtheproductscanbelarge(intheorderoftensofGBs).

● RA.1.Authentication.The infrastructuremustsupportend-userauthentication foraccesscontrolandaccountingpurposes.

● RA.2.Authorization.Theinfrastructuremustsupportend-userauthorizationforaccessingthedataandtheapplicationsdeployedwiththeinfrastructure.

● R1.5.DataAccessAPI.AnAPImustbeexposedtodealwiththestorageresourcestoauthenticate,populate data, retrieve and filter data, update data. Sameoperations formetadata. Data accessshouldhaveashortlatency(nearreal-timeaccess).

TherequirementsanalysishasbeenalreadyperformedinothertechnicalWPsfromdifferentpointsofview,leadingtotechnicalchoicesrelatedtothedefinitionoftheQoSinfrastructure,oftheBigDataecosystemandofthesecuritystrategy.ThefindingsofthoseactivitiesarerelevantforthedefinitionandimplementationoftheprogrammingabstractionlayerandthisdocumentanalysesthetechnologicalchoicesdescribedinthedeliverablesfocusingonhowtheWP5componentshavetobeselectedandextendedtobefullyintegratedintheBIGSEAplatform.

Page 9: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 9

Inparticular,theD3.1documenthastheobjectiveofidentifyingtheservicesoftheQoScloudarchitecturefor the Big Data analytics platform developed in EUBra-BIGSEA. Mesos has been selected for themanagementofdistributedresourceswhoseavailabilitywillbemonitoredandelasticallymanagedaccordingtotheQoSparametersdefinedintheapplications.

D4.1describesthebigdatasystemsintegratedtoaddressmultifacetedusecasesrequirements, includingfastdataanalysisover continuous streams fromexternaldata sources, generalpurposedataminingandmachine learning tools as well as OLAP-based systems for multidimensional data analysis. A relevantoutcomeofthedocumentistheproposalofadataaccessAPI(toaddressR1.5)thatcanbeuseddirectlybytheapplicationsorthroughthetoolsdevelopedinWP5.

D6.1addressessecurityrequirementsrelatedtotheapplicationdevelopmenttools.Inparticular,themainconcernisthepossibilitytodefineprivacyannotationsintheprogrammingmodelinterfaceandtosupportauthentication,authorizationandaccountingmechanisms.

BasedonthistheAbstractionLayer(AL)hasthefollowingrequirements

RAL1. Support to QoS batch jobs. The programming framework runtime must provide support to theexecutionofbatchjobswithpossibleQoSconstraintsastimeandnumberofresources.

RAL2.IntegrationwithMesos.TheprogrammingframeworkruntimehastobeabletoscheduletaskstotheMesosmiddleware.

RAL3.Supporttoreactiveelasticity.TheruntimemustbeawareofthechangesintheQoSinfrastructureinordertoadapttheschedulingpolicies.

RAL4. Support toHDFS data locations.The applications should be able to read andwrite data in HDFSbackends.Thiscouldimplytoextendtheruntimedatamanager.

RAL5.DefinitionofQoSconstraintsintheprogramminginterface.Acentrictopicoftheproject,theabilitytodefineQoSparametersbothatapplicationdefinitionandatexecutiontime,dependingonthetypeofmetric.

RAL6. Support to data privacy in the definition of the algorithms. Includeways for the programmer toexpresstheprivacycharacteristicsofthedevelopedalgorithms.

RAL7. Definition of Big Data workflows. Directly maps the RE.7; as explained later the idea of theAbstractionsLayeristoprovidebuildingblocksfortheusecasesthatcanbecomposedasworkflows.

3.2 TypesofusersDifferent classesof users coulduse tools providedby theProgrammingAbstraction Layer. The followingclassesofusershavebeenidentified,giventhetypeofrequirementandprivilege:

● Developers,usetheProgrammingAbstractionLayer todevelopend-usersapplications, to test theunderlying systemand perform the execution of ETL, datamining and analytics processing. Also,developersmaydefinerestrictionsregardingQoSandAAA.

● Domain experts, use higher levels of abstractions (e.g. workflows), to compose processing tasks,generatenewmachinelearningmodelsandassessAAApolicies.

Page 10: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 10

● Studentsandpractitioners learningaboutdistributedalgorithmprocessing,machine learninganddatascience.

4 PROGRAMMINGABSTRACTIONSLAYERDESIGNThis sectionprovidesdetailson thedesignof theprogramming layerof theEUBra-BIGSEAPlatform.Thedescriptionofeachcomponent,itsroleintheplatformandtherelationswithotherWPsarepresented.

4.1 ArchitectureDesignFigure 3 depicts the architectural diagramof the programming abstraction layer. This layer provides thefunctionalitiesneededtosatisfytherequirementsfortheimplementationoftheapplicationsscenariosontopoftheBigDatalayer.

Figure 3 - Architecture of the Abstraction Layer

Theprogrammingframeworksenabletheimplementationoftheusecasesprovidingmodulesandlibraries(buildingblocksinthefigure)thatabstractthebigdatatechnologiestoaccessandprocessthedatasourcesandoptimizingtheirexecutionontheQoSinfrastructure.AbstractionsforspecifyingQoSconstraints(e.g.,jobsexecutiondeadlines,minimumthroughputratetothestorage sub-system)willbe integratedwith theprogrammingmodelandwillbe translated into resourcemanagementpolicies(see4.1.2). There is a strong integrationwith the tools provided inWP3 inorder tomakeuseof the execution anddeploymentservicesandtoadapttheruntimestothechangesintheavailableresourcesaccordingtotheQoSpolicies.According to the design of the Big Data Ecosystem described in D4.1, aminimum set ofmodules to beimplementedshouldprovidesupporttothethreeusecasesandrelatedscenarios:

Page 11: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 11

• DataAccessandloading:EntityMatchingDataqualityanalysis,data ingestion.Applicationsthatperiodicallyreadsourcesofdataandexecuteapotentiallyparallelalgorithmthatproducesabigoutputofdataplussomeindicators.HereQoSiscriticaltoensurethatthedataisobtainedontime.Other applications are basically related to real-time data inspection,whichwill need a scalablepersistentservicethatsupportsclientrequests.Forthisusecase,dataparallelprogrammingmodelsthatsupportstreamingwillbeadopted.

• DescriptiveModels:Long-lastingexecutiontasksthatruninparallelandtrainmodelswiththenewdata.Runningperiodicallywithadeadlineandproducingmodels,whoseoutputmaybeclassifiedasmoresensitivetoprivacyprotectionthantheinput.Inthiscase,differentoptionforclusteringalgorithmsshouldbeprovidedbytheabstractionlayer.

• Predictivemodels:Aservicebasedcontinuouslyrunningwithscalabilitycapabilities,runningjobsthat produce the result of the prediction. The scenarios included in this use case are morecomputing intensive than data intensive bounded. Task parallel approaches are well suited toimplementthisusecaseeventhoughuserstoriesincludetherunsofnearrealtimepredictions.

Inorder toease the compositionof theprogrammedmodules, a tool for thegenerationof codewill beintroducedandextendedtosupporttheprogrammingframeworks. Thefollowingsectionsprovideadetaileddescriptiononthecomponentsthatimplementeachfunctionality.

4.1.1 ProgrammingframeworksOne of themain ambitions of the EUBra-BIGSEA project is to offer a programming layer thatwillmakeapplications effectively scale across the infrastructure, providing also to the developers appropriateabstractionstospecifyQoSconstraintsandaunifiedinterfacethatincludescomputing,dataanalytics,andsecurityAPIs. Thebaseof this layer is theCOMPSs [R01] frameworkthatprovideasimpleprogrammingmodel based on sequential development and a runtime system in charge of exploiting the inherentconcurrencyofthecode,automaticallydetectingandenforcingthedatadependenciesbetweentasksandspawning these tasks to the available resources. In order to guarantee the interoperabilitywith existingapplications,theSparkprogrammingecosystemisalsoincluded.D3.1andD4.1havealreadyanalysedtheuseofSparkatthelevelofexecutionservicesandbigdataecosystem.ThisreportaddressestheintegrationofSparkatthelevelofprogramminginterfacealsoincludingissuesrelatedtothedefinitionofdatalocationsinHDFSusedinCOMPSs. Figure4depictsthedetailedarchitectureoftheprogrammingframeworks.COMPSsisusedtoimplementasetofhigh-levelfunctionalitiesthatcouldbeworkflowsofOphidia[R03]operatorsormodulesimplementingoperationsonBigData backendstoaddressfastdataanalysisovercontinuousstreamsfromexternaldatasources, general purpose data mining and machine learning tools as well as OLAP based systems formultidimensionaldataanalysis.AtthelevelofWP7,eachofthesefunctionalitiesisabuildingblockfortheimplementationofcomplexusecases.Theimplementationofeachblockistransparenttotheuserthathasonly to import themodule in thecode,optionallyprovidingconstraintson theexecutionof thatmodulethroughQoSannotations.

Page 12: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 12

Figure 4 - Detailed view of WP5 components

4.1.1.1 Technologiesevaluation

Identification COMPSs

Type Programmingframework

License Apachev.2

Currentversion 1.4

Website http://compss.bsc.es

Purpose Programming model which aims to ease the development of applications fordistributedinfrastructures,suchasClusters,GridsandClouds.COMPsuperscalaralso features a runtime system that exploits the inherent parallelism ofapplicationsatexecutiontime.

Highlevelarchitecture

ThefollowingfiguredepictsthearchitectureofCOMPSs

Page 13: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 13

TheCOMPSsruntimeisimplementedusingtheJavalanguage,sothemostnaturalprogramming language for new COMPSs applications is Java. Nevertheless, tosimplifytheportingofexistingapplicationswritteninotherlanguages,COMPSshassupportalsoforC/C++andPythonapplications. AcentralconceptinCOMPSsisthatofatask,whichrepresentsthemodel'sunitofparallelism.Ataskisamethodoraservicecalledfromtheapplicationcodethatisintendedtobespawnedasynchronouslyandpossiblyruninparallelwithothertasksonasetofresources,insteadoflocallyandsequentially.Inthemodel,theuser ismainlyresponsible for identifyingandselectingwhichmethods/servicesshewantstobetasks. When the sequential code is executed, the COMPSs runtime intercepts themethodsinvocationsandreplacesthemwithcallstotheruntimethatcreatenewasynchronous tasks. Accesses to task data within the main code are alsoinstrumented,sothattheruntimecanfetchthecorrectdatavaluesifnecessaryfromtheremoteresourcewherethetaskwasgenerated(synchronization). This task selection is done bymeans of an annotated interface where all themethods that have to be considered as tasks are defined with annotationsdescribingtheirdataaccessesandconstraintsontheexecutionofresources.Atexecution time this information is usedby the runtime to build a dependencygraphandorchestratethetasksontheavailableresources.

Dependencies COMPSsdependenciesaresolvedatinstallationtime.

Interfacesandlanguagesupport

COMPSsdoesnotprovideanyspecificAPIforthedevelopmentofapplications.SupportedlanguagesareJava,PythonandC/C++.

Securitysupport Theinteroperabilitywithdifferentbackendsisimplementedthroughconnectors.Inthisway,specificsecuritypoliciescanbeconfiguredontheconnector.

COMPSs runtime

Binding-commons

Python Binding C/C++ Binding

C/C++App

Java App

task

Grid Cluster Cloud

tasktask

tasktask

task

tasktask

task

Javassist

tasktask

task

Containers

PythonApp

Page 14: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 14

Data Primitivetypes(integer,long,float,boolean),strings,objects(instancesofuser-definedclasses,dictionaries,lists,tuples,complexnumbers)andfilesaresupportedinthedefinitionofatask.

Neededimprovement

SupporttoMesosandreconfigurationoftheresourcesaccordingtoproactivepolicies.

Identification ApacheSpark

Type Frameworkforprocessingbigdata

License ApacheLicense,Version2.0.

CurrentVersion

2.0.0-Aug/2016

Website Website:http://spark.apache.org

Documentation:http://spark.apache.org/docs/latest/

Download/Sourcecode:http://spark.apache.org/downloads.html

Purpose To provide a functional programming paradigm abstraction to implement ETL andmachinelearningalgorithms.Sparkhasimplementedmanyoperationstodataprocessingandsupportsdifferentprogrammingparadigms:

● FunctionalprogrammingusingScala,PythonorJava;● Declarative programming using the SQL language compatible with 2003

specification.

HighLevelArchitecture

In the lastest version, 2.0.0, Sparks supports 3 different views of data: RDD (ResilientDistributedDataSet),DataFramesandDataSets.All structuresarestored inmemory ifpossibleandmaybewrittentothedisk,otherwise.

RDDshasbeeninSparksinceversion1.0.Itprovidesasetoftransformationmethods,suchasmap(),filter(),reduce()forprocessingdata.EachtransformationcreatesanewRDD representing the transformed data. Operations on RDDs are executed in a lazyfashion;transformationsarenotperformeduntilanactionmethod,forexample,collect()orcount(),iscalled.

TheDataFrameAPIwasintroducedinversion1.3.0toimproveSparkperformance.Anewconceptofschemawasintroducedtodescribethedata,enablingmuchmoreexpressivecode to be built using efficient network communication and off-heap memory JVMoptimization. Catalyst, Spark query processor optimizer, was built on top of theDataFrame API and now allows users towrite SQL 2003 compatible queries or use afluent-styleAPItoprocessdata.

ThelastAPI,DataSet,wasintroducedinSpark1.6.0.andaimstoprovidethebestoftheRDD and DataFrame worlds: the familiar object-oriented programming style, with

Page 15: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 15

compile-time type-checking (present in RDD) and optimization capabilities (present inDataFrame). The DataSet API uses a specialized Encoder to serialize the objects forprocessingortransmissionoverthenetwork.Conceptually,aDataFrameisanaliasforacollectionofgenericobjectsoftypeDataset[Row],whereaRowisagenericuntypedJVMobject.ADataset,bycontrast,isacollectionofstrongly-typedJVMobjects,dictatedbyacaseclassdefinedinScalaorbyaclassinJava.

[Imagesource:http://spark.apache.org/docs/latest/img/cluster-overview.png]

Underthehood,Sparkisamin-memoryengineforlarge-scaledataprocessing.ApacheSparkisafastandgeneral-purposeclustercomputingsystemandanoptimizedenginethatsupportsgeneralexecutiongraphs.

Asshowninthepreviousimage,aSparkprogramiscontrolledbyadriverprogramstartedby the user, that interacts with a clustermanager to start worker nodeswhere dataprocessingtasksareexecuted.Inastandaloneclusterdeployment,theclustermanageris a Sparkmaster instance.When usingMesos, theMesosmaster replaces the Sparkmasterastheclustermanager.Similarly,whenusingYARN,theYARNschedulertakesthatrole.

Sparkcanbeusedforbatch jobsthroughspark-submit,whichcanbeusedtoexecutebinariesremotely.ThereisalsotheSpark-shell,aScalainteractiveconsole,andPySpark,a Python shell. Thisway, one can executedata analytic operations andexecute theminteractivelyonaremotesystem.

Dependencies A bare metal, YARN orMesos cluster. The use of an HDFS file server architecture isoptional.

Interfacesandlanguagessupported

Itprovideshigh-levelAPIsinJava,Scala,PythonandR.

Securitysupport

Sparkcurrentlysupportsauthenticationviaasharedsecret.SparksupportsSSLforAkkaandHTTP(forbroadcastandfileserver)protocols.SASLencryptionissupportedfortheblock transfer service. Encryption is not yet supported for the WebUI.EncryptionisnotyetsupportedfordatastoredbySparkintemporarylocalstorage,suchasshufflefiles,cacheddata,andotherapplicationfiles.Ifencryptingthisdataisdesired,

Page 16: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 16

aworkaroundistoconfiguretheclustermanagertostoreapplicationdataonencrypteddisks.

Data Datastoredinfilesystems(e.g.,local,NFS,HDFS).TherearemanyotherconnectorsthatallowsSparkread/writedatatootherdatasources/storage(e.g.cloudblobslikeAmazonS3andMicrosoftxxx).

PotentialusagewithinBIGSEA

SparkisoneofthesupportedprogrammingmodelsinEUBra-BIGSEA.Itprovidesalibrary,calledML, that supports the execution of differentmachine learning techniques (e.g.linear regression,classification,clustering) inadistributedway,byusingprogrammingabstractionsandinfrastructureofSpark.

4.1.2 SupporttoQoSspecificationQoSplaysapivotalroleinEUBRA_BIGSEA,whosemaingoalistoprovidesolutionsfortheoptimaldeliveryandruntimemanagementofbigdataapplications.Sinceinprivateorpubliccloudsapplicationssharethesameinfrastructure,theirdemandforresourcesmaycreatecontentionthatreducesthefinalQoSperceivedbytheusers.AsdiscussedinSection3.1,EUBRA-BIGSEAwillsupporttheexecutionofthreemainclassesofQoSbasedjobs:

• QoS Batch jobs, characterized by a total resource budget for application execution in terms ofnumberofcores/containers/memory.

• Deadline-basedjobs,characterizedbyamaximumexecutiontime/deadline.• Shortjobs,possiblyexecutedbystreamingsystems(e.g.,fortweetsanalysis)characterized,notonly

byadeadline,butalsobythesustainedthroughputthatmustbeguaranteed.AconstraintiscompletelydefinedbythefieldsreportedinTable1.

Field Description Mandatory Example

Name Constraintuniqueidentifier Yes my_constraint_name

Targetapplication

Theapplicationwhichtheconstraintisappliedon

Yes my_job_name

Targetmetric

Metrictheapplicationpredicateson.Multipletargetmetricscanbeprovided

Yes CPUorcontainer_number(cpu)memorysize(memory)applicationmakespan(application_execution_time)Numberofsuccessfulrunspertimeunit(application_throughput)

Page 17: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 17

Metricvalue

Valueandinequalitythatmustbefulfilled(oneforeachtargetmetric)

Yes <=10(forCPUorcontainernumber)<=10GB(formemory)<=10min(applicationexecutiontime)>=4completions/min(applicationthroughput)

Priority Avaluedefiningwhetheraconstraintcannotbeviolated(hardconstraintcharacterizedbyvalue0)orcanbeviolated(softconstraint,inthiscasethevalueprovidesarankingamongconstraints,thehigherthebetter).Onepriorityfieldhastobespecifiedforeachtargetmetric

Yes Integervalue

Table 1 - QoS constraints specification

The set ofmetrics that will be initially considered are the number of CPUs/containers that support theapplication execution, the total memory allocated in the infrastructure, application execution time andthroughput.Constraintscan,possibly,predicateonmultiplemetrics(e.g.,shortjobsarecharacterisedbyadeadlinebutalsobyaminimumthroughput).ConstraintswillbespecifiedasJSONfilesandstoredintheMesosmaster.TheinformationiscodedwithintheapplicationJSONdescription,asdescribedinthenextfigure.{ "type": "CMD", "name": "my_job_name", "periodic": "R24P60", "QoS" : [ { "metric": "deadline", "op": "==" "value": "2016-06-10T17:22:00Z+2", "priority": 0 }, { "metric": "cpu", "op": "<=", "value": 10, "priority": 2}, { "metric": "memory", "op": "<=", "value": 10G, "priority": 1}, { "metric": "application_execution_time", "op": "<=", "value": "10M", "priority": 1}, { "metric": "application_throughput", "op": ">=", "value": "24d", "priority": 3 } ]

Page 18: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 18

"command" : "mycommand" }

Figure 5 - JSON specification of the QoS

QoSconstraintswillbespecifiedthroughLemonade(seeSection4.1.3.1). Moreover,sinceEUBra-BIGSEAenvisionsthedefinitionandruntimesupportofapplicationencompassingmultipleruntimeenvironments(e.g.,applicationsincludingpartofthedataanalysisworkflowimplementedinSparkandpartinCOMPSs,possiblyaccessingOphidiaoperators),taskT5.3willdevelopsolutionstooptimallysplitaglobalapplicationconstraint(i.e.,aconstraintpredicating,e.g.,onthewholeapplicationexecutiontime),tolocalconstraintsthathavetobeenactedontheunderlyingruntimeenvironments.InthiswaythedefinitionofWP3proactive-runtimemanagementpoliciescanbesimplifiedbyspecifyingsetsofadaptationrulespredicatingonmetricsand actuatingmechanisms provided by the individual runtime frameworks that support the applicationexecution.AccordingtoEUBRA-BIGSEADoW,thislatteractivitywillstartatM13.

4.1.3 Codegenerationandcomposition

4.1.3.1 Lemonade

Lemonade(LiveEnvironmentforMiningOfNon-trivialAmountofDataEverywhere)isawebapplicationtoolinwhichuserscandraganddropoperationsanddatasourcestocomposedifferentETLandmachinelearningworkflows. Lemonade targets users that do notwant to learn a programming language or that need todevelopworkflowsusingtheexistingtoolset. Regardingusers’spectrum,LemonadefitswelltothoseusersfromareassuchasMathematicians,Statistics,BusinessAdministrationandthoselearningaboutDataScience. LemonadecomponentsareshowninFigure6.

Figure 6 - Lemonade components and supported frameworks (in progress)

Page 19: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 19

Thefirstcomponent,Citron(Figure7),isawebbaseduserinterfacetocreateworkflows.Userscanchooseamongasetofpredefinedoperationswhichwillcomposetheworkflowbydragginganddroppingthemintothedesignarea.Inputdataisspecifiedbychoosingoneormoredatasetsfromthetoolbox.Workflowsarestored in Citron’s relational database andwhen a user triggers the execution of aworkflow, a JSON filedescribingitisgeneratedandsenttoJuicerforprocessing.

Figure 7 - Lemonade Citron user interface

Onlydatasetsaccessiblebyeachloggeduseraremadeavailablethroughhis/herinterface.Operationsarethesmallestunitofprocessingandrepresentacoarsegranularitytaskexecutedononeofthesupportedbackends.Currently,LemonadesupportsETLandsomemachinelearningoperations,aslistedinTable2. Operation Purpose Add Columns Adds columns from one data source to another Aggregation Performs aggregation of data grouped by a set of fields Apply math Apply math Classification model Trains and applies a classification model Clean missing Cleans or replaces missing values from fields Clustering model Trains and applies a clustering model Comment Comment Correlation Identifies correlations between records Data reader Reads data from a data set

Page 20: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 20

Data writer Writes a new data set Filter (selection) Filters data according to some criteria Join Joins two data sets using a set of fields (keys) K-Means Clustering Users K-Means algorithm for clustering Linear regression Applies a linear regression algorithm Logistic regression Performs logistic regression Naive Bayes Classifier Uses a Naive Bayes Classifier Outlier detection Performs outlier detection Projection/Select columns Selects a subset of the fields from data set Publish as a visualization Publishes result as a visualization Publish as web service Publishes a workflow as a web service Sample Generates a sample of data Score model Scores a machine learning model Set intersection Performs set intersection Sort Sorts data from data set according to a set of fields and directions Split Splits dataset in 2 different data sets using weights SVM Classification Uses a SVM Classifier Time series Time series Topic discovery Performs topic discovery in text Transformation Performs a data transformation Union/set union Performs set union

Table 2 - Lemonade supported operations

Newoperationscanbeimplementediftheunderlyingprocessingframeworksupportsthem. ThesecondcomponentiscalledTahitiandisresponsibleforkeepingalloperations’metadataneededtorunthe workflows. Metadata include operation name, description, parameters and ports. Ports arecommunicationpointsthathavedirection(inputandoutput),multiplicity(howmanysupportedconnections)andshould“implement”interfacesinordertoguaranteecompatibilitybetweenoperations.Forexample,ifanoperationhasonlyoneoutputportthatimplementsaninterface“Algorithm”,itcanonlyconnecttoaninputportthatimplementsthesameinterface;itisnotpossibletoconnectittoanoperationwithanoutputportthatonlyimplementstheinterface“Data”. Each operation has a set of parameters grouped as forms. Forms are organized in 3 classes: executionparameters,AAAparametersandQoSparameters.Executionparametersallowuserstoconfigurealgorithmsrun-timeargumentsandbehaviour.AAAparametersarerelatedtosecurityandprivacyaspectsandwillbealignedwithWP6guidelines.QoSparametersdefineinfrastructurerequirementstoexecutetheworkflowandarerelatedtoWP3.

Page 21: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 21

Thethirdcomponent,Limonero, issimilartoTahiti,but insteadofkeepingmetadataaboutoperations, itkeepsmetadatainformationaboutdatasources.Datasourcescanbeinputtoworkflowsandalsocanbecreatedbythemasoutput.Datasourcemetadataincludes:

● Location:wheredataarelocatedandinwhichstoragetechnology(forinstance,HDFS).● Dataformatandstructure:IfthedataareinJSONformat,whatarethecolumnsandtheirdatatypes,

ifanygivencolumnisoptional,ifitisafeatureoralabel.● Accessrestrictions:ownershipofdatasets,authorizationandprivacyconcerns.● Statisticsaboutthedata:numberofrecords,sizeinMB,column-specificinformationsuchastotalof

missingrecords,min/max/average/medianvalues,decilesdistribution,etc. Metadataareusedbywebinterfacetoenableordisabledatavisualisationsandoperations,accordingtodata/visualisationanddata/operationcompatibility.Forexample,apiechartwouldrequireatleast2fields:oneforthelabelandotherforthevalueandthevalueshouldbenumeric.Ifdatasetattributesdonotmatchthevisualisation requirements, thevisualizationwillnotbeavailable. Inanotherexample,aclassificationoperationwouldbedisabledintheinterfaceiftheinputdatasetdoesnothaveacolumnspecifiedasalabel,otherwisetheoperationwouldnotbeabletolearnhowtoclassifythedata. MetadataenableLemonadetoloaddatainoptimizedformats.InsteadofhavingtoparseCSVorJSONfilesintorecords,Lemonadecanloaddatainbinaryformats,suchasParquet[R02]. Finally,thelastcomponent,Juicer,hasfourmainresponsibilities:

1. ReceiveaworkflowspecificationinJSONformatfromCitronandconvertitintoexecutablecode.2. Executethegeneratedcode,controllingtheexecutionflow.3. Reportexecutionstatustotheuserinterface(Citron)4. InteractwithLimoneroAPIinordertocreatenewintermediatedatasets.Suchdatasetscannotbe

usedas input tootherworkflows,except ifexplicitlyspecified.TheyareusedtoenableCitrontoshowintermediateprocesseddatatotheuser.

Underthehood,Lemonadewillgeneratecodetargetingadistributedprocessingplatform,suchasCOMPSsorSpark.ThecurrentversionsupportsonlySpark,andthegeneratedcodeisexecutedinbatchmode.Futureversionsmay implementsupportto interactiveexecution.ThiskindofexecutionhasadvantagesbecausekeepingSparkcontext loadedavoidsanyoverheadfromstartingtheprocessingenvironmentandloadingdataforeachstep.Thisapproach(keepingthecontext)isusedinmanyimplementationsofdataanalyticsnotebooks,suchasJupyter,ClouderaHueandDatabricksnotebook.

4.2 ApplicationlifecycleInthissection,asequencediagramoftheinteractionsbetweencomponentsofthesoftwarearchitectureispresented.InFigure 8adeveloperinteractswiththeLemonadeinterfacetocomposeanapplicationusingexistingmodules stored in the internaloperationmetadatadatabase.Asa result, thecodeofaCOMPSsapplicationisgeneratedbyLemonade;theCOMPSsruntimetakescareofthedeploymentoftheapplicationontheBIGSEAQoSinfrastructureandofthecallstotherequiredAnalyticsServices.

Page 22: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 22

Figure 8 - Application lifecycle diagram

4.3 SecurityaspectsThedeliverableD6.1 (RequirementsandCoordinated Security Strategy) defines the security scopeof theproject and proposes global security solution to deal with the security objectives of the project: theprovisioningofAuthentication,AuthorizationandAccounting(AAA),theassuranceofthesecuritypropertiesofthecloudandBigDataservices,andtheprotectionofthedataprivacy. Figure9presentsahighlevelviewoftheWP5componentstogetherwiththespecificconcernsthatshouldbehandledbyWP6.Detailsontherepresentedroles,colour-codeused,andthecorrespondentcomponentscanbefoundinD6.1,whiledetailsonSpark,COMPSsandLemonadecanbefoundinSection4.1.1.

Page 23: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 23

Figure 9 - Relation between WP5 and WP6 (adapted from D6.1)

According to the analysis performed in D6.1, the security concerns related with the architecture to beproposedinWP5togetherwithWP4andWP3,canbesummarizedin5mainpointsasidentifiedinredandgreen in Figure 9 , and which are aligned with the security concerns of the project. WP5 will proposeprogrammingabstractionsthatworkbasedontheunderlyinglayers,andthereforethesecurityconcernsofWP5arealsointegrallyconnectedwiththeonesofthoselayers. Regarding security, WP5 requires services for authentication, authorization and accounting to theinfrastructureandtheapplicationsanditisalsoimportanttoassuretheprivacyandaccesscontrolofthedata for the operators that work with WP4. Another concern is the security of the API provided forapplicationdevelopment,whichshouldnotallowthedeveloperstoperformtasksthatinterferewithotherapplicationsrunningandalsoshouldnotallowmalicioususerstotakeadvantageofitsinputstosubvertthefunctionalitiesoftheapplications. To address these objectives, requirements were defined in D6.1. Following we summarize the keyrequirementswhiledetailscanbefoundinSection4ofD6.1. WP6AAAcorrespondstotheAAAProvisioninganditwillbenecessarytodeveloptwodistinctAAAblocks,whichhavedistinctfunctionality,asfollows:

1. EUBra-BIGSEAInfrastructureAAA,whichprovidestheAAAfunctionalitiesrequiredformanagingtheEUBra-BIGSEA framework (access to cloud resources), fromboth the Infrastructure andPlatformperspectives(focusingoninfrastructuremanagersandapplicationdevelopers/providers).ThescopeofthisserviceisthewholeEUBra-BIGSEAframeworkasitmatchesthenatureoftheservicesfocusedoncloudinfrastructuremanagement.

2. EUBra-BIGSEAApplicationsAAAaaS,whichprovidesAAA-as-a-Service for applications developedand hosted in the EUBra-BIGSEA framework and in need of services for authenticating andauthorizingtheirendusers.ThescopeofAAAaaSinstanceislimitedtotheapplicationmakinguseofit, and AAAaaS directly matches the nature of the set of services focused on end-users andenterprise/consumerapplications.

WP6AssurancescorrespondstothesecurityassurancesrequirementsR6.3.1andR6.3.2(seeD6.1),andthatcanbedetailedasfollows:

Page 24: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 24

1. TheEndUsersoftheapplicationsthatwillruninsidetheEUBra-BIGSEAinfrastructureshouldnotbeabletosubvert,throughtheinputsofsuchapplications,thefunctionalitiesimplemented.Thisway,it is necessary to perform a detailed assessment of the APIs to be used in the development ofapplications.

2. The Data App Developers should not be able to develop applications that interfere with theremainingapplicationsrunninginsidetheframework.Forthis,besidestheassessmentoftheAPIsmadeavailable,itwillalsobenecessarytoproposeasetofrecommendationsfordevelopmentbestpractices.

WP6 Privacy corresponds to the concerns regarding the security of the data, i.e. the protection of theconfidentiality, privacy and anonymity of the data. It is necessary to include in the WP5 programmingabstractionswaysfortheprogrammertoexpresstheprivacycharacteristicsofthedevelopedalgorithms.Theseabstractionswillbeenforcedintheunderlyinglayers,throughmechanismstobeimplementedinWP4.Lemonade(see4.1.3.1)isthebestcandidatetoprovidetheuserstoexpressthesepreferences.Inpractice,asetofoperatorsaretobedevelopedtoextendtheLemonadesyntaxinordertoincludeinformationthatallowsthecharacterizationofthealgorithmsaccordingtothewaytheirinternalsinfluencethedatabeingprocessedintermsofprivacy.

Page 25: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 25

4.4 TechnologyAnalysisThissectionanalyseshowthepreviouslydescribedcomponentssatisfytherequirementstoimplementtheusecasesontheQoSBIGSEAplatform.

Requirement Description COMPSs Spark Lemonade

RAL1.SupporttoQoSbatchjobs

SupporttoQoSboundedjobs

COMPSssupportsdifferentbackendsthroughconnectorsthatimplementspecificfunctionalities;QoSprovidedbytheusercanbetranslatedtoresourcesconstraints

QoSconstraintswillbesupportedthroughcommandlineoptions

Providedbyunderlyingtechnology.WillprovidemechanismstoannotateQoSrequirementsforapplicationsandtheircomponents.

RAL2.IntegrationwithMesos

TheprogrammingruntimemustexecutethetaskstotheMesosmiddlewareandavailableframeworksasYARNandMyriad.

COMPSscaneasilybeextendedthroughconnectors

Yes Providedbyunderlyingtechnology

RAL3.Supporttoreactiveelasticity

TheruntimemustadaptitsresourcesaccordingtothechangesintheQoSinfrastructure

COMPSsadaptstheusageofresourcesaccordingtothecomputationalload.Reconfigurationofthepoolofresourceswillbeevaluatedintheproject

Sparkprovidesbasicmechanismsforaddingandremovingworkernodes.Wp3willimplementsolutionsandadvancedpoliciesforruntimeclusterreconfiguration

Providedbyunderlyingtechnology

RAL4.SupporttoHDFSdatalocations

TheapplicationsshouldbeabletoreferencedatainHDFSstorage

No.SupporttoHDFSisanobjectiveoftheproject

Yes Providedbyunderlyingtechnology

Page 26: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 26

RAL5.DefinitionofQoSconstraintsintheprogramminginterface

TheprogramminginterfacemustprovidewaystoexpressQoSthatwillbetranslatedtoresourcesconstraintsbytheruntime

ItispartoftheWP5activities

No.ItispartofWP5objectives

Yes

RAL6.Supporttoprivacyinthedefinitionofthealgorithms

Includewaysfortheprogrammertoexpresstheprivacycharacteristicsofthedevelopedalgorithms

No.ItispartofWP5objectives

No.ItispartofWP5objectives

Yes

RAL7.DefinitionofBigDataworkflows

Supporttheexecutionofbigdataworkflows,wheretheinputdataandtheproductscanbelarge(intheorderoftensofGBs).

Yes.WorkflowscanbeprogrammedinCOMPSswithoutanyAPI.Datadependenciesareautomaticallydiscoveredandmanagedbytheruntime

Yes Yes

Table 3 - Requirements and technologies

5 CONCLUSIONSThis document has provided the description of the programming abstraction layer of the EUBra-BIGSEAPlatform.TheobjectiveofthisdocumentistocomplementthedeliverablesD3.1,D4.1andD6.1thatfocusonthedefinitionoftheQoSinfrastructure,theBigDataecosystemandthesecuritystrategy.HeretheaimistoidentifythetechnologiesthatcanbeadoptedbytheusersoftheplatformtotransparentlyimplementBigDataapplications. The analysis of the user requirements has led to the definition of a set of specifications for theimplementation of the components. The basis of the abstraction layer is the COMPSs framework thatprovides a programmingmodel to define applicationswhose execution,where possible, is automaticallyparallelized.COMPSswillbeextendedtobeinteroperablewiththeMesosmiddlewarethusbenefittingfromthecapabilityofautomaticallyincreasingresourcesbasedontheQoSdrivenmechanisms.COMPSswillbeusedtoimplementnewworkflowsontopofthedataanalyticslayerthatwillbeusedasbuildingblocksfor

Page 27: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 27

theenduserapplications.InordertocomplementCOMPSsandtosupportexistingMLalgorithms,theSparkprogrammingmodelwillbeused. TheLemonadetoolwillbeadoptedasgraphicalinterfacetocomposeapplicationsandtogeneratecodeforCOMPSsandSpark.

6 REFERENCES[R01]BadiaRM,ConejeroJ,DiazC,EjarqueJ,LezziD,LordanF,Ramon-CortesC,SirventR.COMPSuperscalar,an interoperable programming framework. SoftwareX [Internet]. 2015 ;3-4:32-36. Available from:http://www.sciencedirect.com/science/article/pii/S2352711015000151. [R02]ApacheParquet.Availablefrom:https://parquet.apache.org/[R03]S.Fiore,C.Palazzo,A.D’Anca,I.T.Foster,D.N.Williams,G.Aloisio,“Abigdataanalyticsframeworkforscientificdatamanagement”,IEEEBigDataConference2013:1-8.[R04]BenjaminHindman,AndyKonwinski,MateiZaharia,AliGhodsi,AnthonyD.Joseph,RandyKatz,ScottShenker,and IonStoica.2011.Mesos:aplatform for fine-grained resourcesharing in thedatacenter. InProceedingsof the8thUSENIX conferenceonNetworked systemsdesignand implementation (NSDI'11).USENIXAssociation,Berkeley,CA,USA,295-308.

Page 28: D5.1 EUBra-BIGSEA software architecture v1.1...Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale

EUBra-BIGSEA D5.1:EUBra-BIGSEASoftwareArchitecture

www.eubra-bigsea.eu|[email protected]|@bigsea_eubr 28

7 GLOSSARY

Acronym Explanation UsageScope

AAA Authentication,AuthorizationandAccounting Security

API ApplicationProgrammingInterface Interfacing

CSV CommaSeparatedValue Datatype

EC3 ElasticComputeClusterintheCloud Elasticity

ETL Extraction,TransformationandLoad DataIntegration

HDFS ApacheHadoopDistributedFileSystem Storage

JSON JavaScriptObjectNotation DataType

JVM JavaVirtualMachine Processing

MESOS AResourceManagementplatformthatabstractsCPU,memory,storage,andothercomputeresourcesawayfrommachines

ResourceManagement

OLAP OnlineAnalyticalProcessing Processing

QoS QualityofService Scheduler