business applications integration in the cloud
DESCRIPTION
Filip Rogaczewski - Atlassian Connect Team Lead. Presentation from Gdansk University of Technology about integration business application in the cloud i.e. how to integrate 50 000+ servers together.TRANSCRIPT
Fil ip Rogaczewski • [email protected] • Spartez/Atlassian
ETI graduate
Team leader in Spartez
Previously worked in Lufthansa, NASA, Intel
Running, biking, paraglidingTravellingPhotography
BUSINESS APPLICATIONS INTEGRATION IN THE CLOUD: HOW TO INTEGRATE 50 000+
SERVERS TOGETHER
C A S E S T U D I E S
W H Y
AgendaH O W
U I I N T E G R AT I O N
O P P O RT U N I T Y R E S T A P I
M E S S A G I N G
M U LT I - T E N A N C Y
D E P L O Y M E N T
W H Y
Case study: Facebook
Recomm
ended
Chat
Friends feed
Activity Stream
Applications
Chat
Many distinct services
integrated into a single application
W H Y
Service Oriented Architecture
SOAP (simple object access protocol)
XML RPC (remote procedure call)
RMI (remote method invocation)CORBA
SOA: Loosely coupled &
independently working services
W H Y
Service Oriented ArchitectureScales the application
• Loosely coupled services• Less resource restrictions for services• Communication with well defined API• Allows better technological choice for services• Distinct deployment models
Service
ServiceIntegration HTTP
CONTAINER
W H Y
Service Oriented Architecture
Type I Web
Type III DB
Type IV Hadoop
Type V Haystack
Type VI Cache
Type VII Cold storage
CPU (2) XeonE5-2670
(2) XeonE5-2660
(2) XeonE5-2660
(2) XeonE5-2660
(2) XeonE5-2660
(2) XeonE5-2660
Memory 16GB 144 GB 64 GB 96 GB 144 GB 144 GB
Disk(1) 500 GB
SATA3.2TB PCI
Flash (15) 4TB SAS (30) 4TB SAS(1) 2 TB
SATA(240) 4TB
SATA
Different hardware stack for services in Facebook
Problems faced by Facebook today, are our problems
in few years
W H Y
Service Oriented ArchitectureMore effective organisation
• Each team running a single service.• Each team is cross-functional (designers, product managers,
testers, developers, ops-engineers).• Decision about roadmap happen locally.• Geographically collocated teams, one service in USA, second
service in Australia, third in Poland.• Easier to scale work, multiple teams working at the same
time.
What is the alternative?
W H Y
In Process Integration
Add-On
In Process • Resources are shared• Access to all data• Doesn’t scale
Tied to the stack • Language• Frameworks
No clear API boundariesAdd-On
CONTAINER
Who else does integration?
W H Y
Spotify
Each item is distinct service
Music stream
Friends feed
Browse music service
W H Y
Atlassian: JIRA
AttachmentsHipchatConfluence
Bitbucket
JIRA Agile
Internal application
composition. Why else?
W H Y
Integrations of multiple applications
You can sell all your products instead of one.
W H Y
Extending with marketplaceCustomers always want more features.If you can’t give it to them, let someone else do this - marketplace.Cash 25% of what external vendors sold using your marketplace.
30 000 000 $/year
W H Y
Enterprise customersCustomers who want to integrate your product with their existing applications
HR
Communication
Environment
CRM
Assetmanageme
nt
Supply chainGRC
Finance
W H Y
AcquisitionsYou buy next fantastic company.
???
You want to quickly integrate this feature.Can take couple of months if you have an integration layer ready.Might never be done, if you don’t.
C A S E S T U D I E S
H O W
AgendaW H Y
U I I N T E G R AT I O N
O P P O RT U N I T Y R E S T A P I
M E S S A G I N G
M U LT I - T E N A N C Y
D E P L O Y M E N T
UI integration
H O W
How to embed external HTML here?
H O W
IframeNever embed HTML from external sites. When using iframes, browser provides security:
• Don’t set sandboxing to allow-forms, allow-scripts, allow-same-origin, allow-top-navigation. This is a security model very difficult to manage.
Sign the URL so server rendering content can authenticate the request.Optionally pass context parameters.Use CORS or postMessage for communication.Performance issues.
Security
H O W
Security: How to verify this request?https://whoslooking-stg.herokuapp.com/poller?issue_key=ACJIRA-157 &tz=Australia%2FSydney &loc=en-US &user_id=frogaczewski &user_key=frogaczewski &xdm_e=https%3A%2F%2Fecosystem.atlassian.net&xdm_c=channel-whoslooking-connect-stg__whos-looking&cp=&lic=none
&jwt=
eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJmcm9nYWN6ZXdza2kiLCJxc2giOiJiZjA1NmU5MjEzYjBkODIyNDAwNzg4YmQ4MThhNDk4YmM0NGQ0OTMyYTM2MWU1Mjk1ZjcwMTczOGRiMGRjOTA2IiwiaXNzIjoiamlyYTo1OTk3NWQ2Ny00Y2EwLTRlOWUtOTk2MC1kMWFhYWU3NmJiMzkiLCJleHAiOjE0MTMxMzI2NTksImlhdCI6MTQxMzEzMjQ3OX0.Da8VXjL_9z5xyzErtaJohHKH-xx-0Rp-9MF_xtIvcaY
H O W
Security: URL signing requirements
2. Issuer: identify the application instance which issued the request. Is this jiraForEti or is this jiraForGdanskUniversity?3. Expiration time of the token. Time in UTC after which you should no longer accept the token.4. Query hash. Prevents URL tampering. 5. Id of the user for authorisation.
1. Signature for validation who created the request.
6. Algorithm used to sign the URL.
H O W
Security: Signature validation
2. Upon installation host and service exchange a shared secret.3. Service receives a public key of the host. Service have to verify the public key. Each service expose REST API for public key retrieval.4. During request service extracts the issuer and signature algorithm from the URL and retrieves the sharedSecret for the issuer.
5. Service signs encodedHeader.encodedClaims with algorithm from the header and verifies if the signatures match. If yes, return content. If no, return 403 (forbidden).
1. Token has the following form:
IFRAME AND PARENT
COMMUNICATION
H O W
SandboxingAn iframe instance whose parent and child reside on different domains or hostnames constitutes a sandboxed environment. The contained page has no access to its parent. These restrictions are imposed by the browser's same origin policy.
There are a few limitations applicable to iframes:• Stylesheet properties from the parent do not cascade to the
child page• Child pages have no access to its parent's DOM and JavaScript
properties• Likewise, the parent has no access to its child's DOM or
JavaScript properties.
H O W
Cross origin resource sharing (CORS)1. Keep the list of whitelisted URL with services allowed to access server resources.2. When executing cross-origin request, the browser header: Origin: http://service.atlassian.net3. If the service is whitelisted, server should return:
Access-Control-Allow-Origin: http://service.atlassian.net
DO NOT USE JSONP4. Multiple headers for: choosing a subset of allowed headers
(Access-Control-Allow-Headers) choosing a subset of allowed HTTP methods
(Access-Control-Allow-Methods)
H O W
window.postMessage1. Create clear JS API between parent and iframe.
2. Parent creates an event listener for a message. window.addEventListener("message", executeXHR, false);
3. Client executes:window.parent.postMessage(“request",
JSON.stringify({url: ‘/rest/api/2/dashboard’,success: function() { alert(“1”);}}
)
4. Parent executes the request on behalf of the child and postMessage the results.
5. Difficult to implement. Host should provide a library with abstraction over JS functions it can handle.
Performance
H O W
Performance: Apdex
New relic: measuring user satisfaction
• In Atlassian• Satisfied 1s• Tolerating 3s
• Our Apdex goal is 0.9• Apdex between 0.85 to 0.93
is considered to be a good score.
• For business applications users are more tolerant then for customer applications
• Financial services are out of scope.
H O W
Performance: Latency1. Latency Within California? Within Europe? Across Atlantic? US to Australia? EMEA to Asia Pacific?2. Response times of the application is different in various geographical regions. The customer in US will usually have much better performance then the one in Europe.3. Use CDN for caching of static resource (akamai, cloudfront, edgecast)4. There are enterprise class solutions reducing latency (Verizon Enterprise Solutions)
30 ms
30 ms
90 ms
210 ms
250 ms
H O W
Performance: iframe requestPage containing an iframe
H O W
Performance: iframe requestPage containing non-iframe embedded content
REST API
H O W
How do I change this data?
W H Y
REST APIRepresentational state transfer.API is Application Programming Interface.For API to make sense, it needs to be stable. Each service needs an API policy.Unless the REST API creates security risk, it can’t change without a previous notice (deprecation period) when services can start using a valid replacement or announce a end of life for a feature.Unfortunately, errors are also API. Bad return codes can’t change for instance.API should be versioned. Don’t change current API, release a new one.
“Be liberal with what you accept, be consistent with what you return”
Be precise with accepted and returned content-type.
W H Y
GET methodrest/api/issue/ should return all issues?
rest/api/issue/ACJIRA-1 should return a details of a particular issue.
NO. Collections should always be paginated. Returning everything is never realistic in large systems.
NOT all of them. Let user define as query parameter fields which should be returned. You are loosing precious CPU cycles and network bandwidth for returning everything.
ETag header in response for GET: “ETag: xyz”Second request with header: ”If-None-Match: xyz”304 when not modified, OK when changed with new ETag. Or not found.
rest/api/issue/ACJIRA-1 should return ETag
W H Y
HATEOSrest/api/issue/ACJIRA-1/delete is not a valid GET usage.
Use HATEOAS (Hypertext As The Engine Of Application State)
{"href": "rest/api/issue/ACJIRA-1","rel": "self","method": "GET"
},{
"href": "rest/api/issue","rel": "all-paginated","method": "GET"
},{
"href": "rest/api/issue","rel": "create","method": "POST"
}
{"href": "rest/api/issue/ACJIRA-1","rel": "update","method": "PUT"
},{
"href": "rest/api/issue/ACJIRA-1","rel": "delete","method": "DELETE"
},{
"href": "rest/api/issue/ACJIRA-1","rel": “partial-update","method": "PATCH"
}
idempotent
idempotent
not idempotent
idempotent
idempotent
not idempotent
W H Y
REST API securityPrefer the same mechanism as for UI authentication
Possible to use BasicAuth, OAuth, but only with SSL/TLS.
Always check permissions of the user.
Interesting problem to solve? We have a project ACJIRA and user Filip who can’t access the project. What return code shall he get?
403 (forbidden) reveals that the project exists. Projects are often named after the company name for which the service is provided.Companies may disagree to publicly acknowledge relationship with another company.
It should be 404 (not found)
W H Y
AaaS (API as a Service)You don’t need to write all APIs yourself. You can integrate with existing APIs. APIs directories/marketplaces where you can buy APIs. Be careful with passing the user data to external services.
Messaging
H O W
How do I know about data change?
CI server doesn’t execute PUT request /issue/ACJIRA-27 build completed. How would it know who is interested?It publishes information that the build was completed, jira-build-monitor-service registers a listener for this information.
H O W
MessagingThere are many approaches and concepts around messaging. The key differentiator is message delivery guarantee. It is easy to have 90% or 95% message delivery guarantee. Assuring 100% message delivery is almost impossible. It may require complete service rewrite. It is very important to understand the use case to make a decision what is the expected message delivery. Send messages asynchronously. Connections are precious resources for your service. Messages are API as well. They should have a clear contract and deprecation policy. Make them granular.Specify the content type. Be careful with content-length, too long may DOS the receiver. Sign the request.
H O W
What can go wrong?Server dies during a change.
Server died after change, before sending the message.
What if the message was not delivered?
Event sourcing - record each change in a database. If server died, there is no change to message. Each change have a sequence number.
Database trigger. Move the message to a queue. What if database server dies?
Resend with a possible duplicate flag. Is the order preserved? Who is controlling this? What if the controlling node of publisher dies?
Server died during processing the message?Pull the message again with REST request to publisher. Parametrise the request with last successfully processed message.Use some Queue Service implementation acting as a proxy. Amazon SQS for instance.
H O W
Eventually consistentIt costs a lot of money to provide message guarantee (implement all the steps from previous slide).Most business applications can life without reliable messaging for a while.When running 52 000 servers or more (it will always be more), you need to acknowledge that things are going fail and messages are not going to be delivered.Apply resilient architecture, which polls for data change (event sourcing again) if the messages are not delivered.
MULTI-TENANCY
H O W
How do I ensure I display proper data?
I want to display information about related pages owned only by this customer.I want to display information only about source code changes made by organisation of my current customer.
H O W
Multi-tenancyAbility of the single application to serve requests from multiple customers at the same time. When the application is written for the on-premises clients, it doesn’t make sense to support multiple organisations.When the application is written for the cloud, it doesn’t make sense to host each customer separately. Customers with a single office use JIRA 8h a day. It can serve other customers for remaining 16h.
Single server can process 500 concurrent users. It can host 10 small companies.
The application should be written to run with 0-tenants and 1000-tenants.
H O W
Multi-tenancy is difficultWe have data of Nike, NASA and Twitter. We can’t leak this data.
Encrypted information about the tenant needs to be propagated with each request. When passing this information, it must be encrypted along with a timestamp.
Tenant id must be unique and strong. DON’TS: put the hostname, organisation name or any other data to tenant id. This data will change. We had an error: https://ecosystem.atlassian.net/browse/AC-811
Tenant id is public.
OpenID provider for all services.
DEPLOYMENT
H O W
How do I deploy this?52 000 servers in multiple data centers.
Difference in - os version (good if the os is the same) - hardware - database version - schema versionYou can’t update everything at the same time: - no expected downtime - data centers not optimised for 100% energy utilisation - data centers not optimised for the heat.Services updated independently: - each team owns it own deployment schedule - each team may maintain couple of versions of services - experimental features may be enabled/disabled on some services
H O W
Fast Five - Quality at speed
Stage Behaviour Data Code Data schema Activation Comment
1 Old Old Old Deployment Code is running as is.
2 OldNew and
old together
Old Deployment New code deployment.
3 OldNew and
old together
NewDeployment
or Configuration
Database migration.
4New and
old together
New and old
togetherNew
Deployment, Configuration or Context
Slowly enable the feature on all racks. Features might be enabled
in various configurations.
5 New New New Deployment Delete the obsolete code.
H O W
DEV/DOG/PRODDeployment never go to client first. First versions are deployed to development environment. Development environment is tested with production versions of remaining services. Good development versions are promoted to dogfood environment. This version is used there internally against production versions of other services. Good dogfooding versions are promoted to production environment. Futures are slowly enabled on production.Possible issues: - New service was not tested against all versions running in production. - Couple of new services deployed at the same time. They were never tested together. Release manager should resolve this issue and schedule the feature release.
Thank you!