consulting and development agency. a data science and ... · provide full stack data science...
TRANSCRIPT
A data science and machine learning
consulting and development agency.
Data science projects are complex and require a range of technical
skills to ensure successful outcomes. Blue Orange offers full-service
development to increase the delivery of cutting-edge data insights.
Think data science as a service.
Predictive Analytics
Make DataWork for You
Predictive analytics uses many
techniques from data mining, statistics,
modeling, and machine learning to
analyze current data to make
predictions about future.
Data Warehousing Data Visualization
A unified data warehouse is a federated
repository to store all the data types.
This unification simplifies access and
expands analytics capabilities over
transaction based data storage.
Data visualization provides a quick, clear understanding of the information. Advanced analytics outputs require accurate and transparent representation to drive adoption and support consistent decision making.
Our Capabilities
AWS Registered Partner
What we do:
Provide full stack data science support to scale the capabilities of internal consulting data teams. From cloud infrastructure
engineering and custom ML development to integrated dashboards and data strategy, Blue Orange helps companies make
better decisions with their data.
● Dynamic Pricing
● Lead Segmentation
● Customer Churn
● Recommenders
● Trend Analysis
● Customer Lifetime Value
● Natural Language Processing
● Talent Analytics
Experience
Our Approach
Cas
e St
ud
y
Since we had a production
predictive model, our focus was to
identify and improve future results
based on existing results. We
began by aggregating historical
data to estimate the value of each
keyword as a benchmark. The bank
had stored historical bid data on
previous campaigns and
determined a correlation keyword
attribution.
1Estimate Keyword Value
2Model Selection
3Solutions
SectorBanking
VerticalMarketing Optimization
Th
e P
rob
lem
Technologies
The bank had a third-party ‘black box’ ad bidding system and they wanted to verify efficacy while improving price-per-click (PPC) bids on search terms in Google Ads marketplace. The company lacked an accurate way to estimate their marketing attribution per keyword. They relied on qualitative tracking to assess the third-party bidding tool.
We identified this as a Multi-Armed Bandit
(or contextual bandit with keyword
estimation data) problem where the
problem is defined as choosing to allocate a
fixed set of resource between alternative
options. In this case, the Estimated Keyword
Values were applied against competing
campaigns.
This approach asserted preference to
campaigns that are performing well within
target estimations, while deranking
variations that would underperform.
ModelUpper Confidence Bound/Epsilon-Greedy
A few approaches were applied to optimize different campaigns.
Other models were tested but these resulted in the most
immediate improvement:
Upper Confidence Bound: This strategy is based on the Optimism in
the Face of Uncertainty principle, and assumes that the unknown
mean payoffs of each arm will be as high as possible, based on
observable data.
Epsilon-Greedy: A randomly chosen campaign was selected a
fraction ε of the time. The other times, the arm with highest known
payout is pulled and these were measured in comparison then
reinforced.
Extermax was looking to optimize their customer acquisition process and they were looking for a data-driven methodology for these solutions. They wanted to predict highest value customer engagement touchpoints as well as model Customer Lifetime Value (CLTV). Their focus was on sales and marketing optimizing, but other departments ended up using CLTV to calculate benefits.
They aimed to measure and determine optimal solutions for the following:● How much should I spend to acquire a customer? ● What types of customers should sales reps spend the most time on trying to acquire?● What is the most effective marketing touch point and at what frequency?
We developed a user acquisition tracking platform for a mobile gaming client.
Built a predictive tracking tool to customize each marketing touch point for potential customers. Additionally, we calculated the Lifetime Value of each individual user among several cohorts segmented by each network.
● Designed and implemented customer based prediction models (Linear Regression models, NDB/pareto model) to calculate Lifetime Value per user.
● Applied user segmentation for enhanced user acquisition. ● Determined the specific efficiency (profit/loss) of individual ad-networks.● Performed several user segmentation techniques: K-Neighborhoods + PCA and
customized RFM analysis.
Th
e P
rob
lem
Cas
e St
ud
y
SectorGaming
VerticalCLV/Market Segmentation
TechnologiesModelK-NN/PCA
ModelNDB/Pareto
We developed a custom data platform to track Exelon’s corporate reputation. This project included ingesting, correlating and aggregating all publicly available media sources on the open web related the company or corresponding reputation drivers.
All data was scraped, ingested, and processed using a series of NLP techniques, including sentiment, topic modeling, classification and grouping. For advanced processing analysis, data was persisted in a graph database for advanced in-pipeline analytics. The underlying data volume was very large, requiring in-memory processing for ongoing analysis.
We delivered a series of custom dashboard focused on topic modeling, time-series, anomaly detection and aggregation summaries in an interactive application. Due to the data volume, we used both large precomputation and near-real time aggregation indexing jobs to provide interactive dashboards.
Exelon identified a range of brand reputations drivers for the company, which now inform all communications and marketing activities. Exelon required a natural language analysis engine to measure, correlate and visualize the drivers related to online and earned media and its potential effect on the company.
Th
e P
rob
lem
Cas
e St
ud
y
SectorEnergy
VerticalMarketing Optimization
TechnologiesModelNatural Language Processing
In an effort to systematically improve data standardization and quantify the hiring pipeline, we applied
numerous data science techniques in two foundational aspects of the hiring pipeline.
Unstructured to Structured Data Processing● We used pLSA/LDA for resume topic modeling. This was applied to extract structured
attributes from unstructured associated text.
● We applied SVM/Random forest and other models to classify and clean this extracted
content based on different weighted factor provided by the SME.
Candidate Scoring● We first implemented a weighted heuristic model to established a benchmark.
● To allow for improved and standardized candidate ranking, we used a heavily feature trained
logistic regression model.
Point72 Asset Management was looking to quantify beneficial hiring characteristics and to develop predictive
hiring indicators to filter candidate applications. They had 10 years of unstructured free-text, both through
resume, third-party data and interview notes. This contained large amounts of unstructured (free text, scans,
emails) data. They were looking to standardize this data for improved analysis and to reveal non-standard
correlative success factors. Th
e P
rob
lem
Cas
e St
ud
y
SectorFinance
VerticalTalent Analytics
ModelSVM/Random Forest
Technologies
To save time and money, we opted to
verify our model on a derived dirty
dataset with characteristics similar
to our target data. The benefit of
generated data is that we have an
accurately labeled dataset to isolate
model accuracy from data accuracy.
Since we required manually
generated training data, this allowed
us to test and train multiple models
before investing effort in manual
data curation.
Point72 was looking to enrich highly variable resumes data with many additional data sources. Each of the data
sources were riddled with inconsistencies, misspellings, and missing data. This caused highly varied record
linkage and prevented accurate merging, deduplication and association. This impacted nearly 60% - 70% of the
applicable candidate data.
Th
e P
rob
lem
Cas
e St
ud
y
SectorFinance
VerticalTalent Analytics
Using field names, data type and
distributional divisions, we predicted
related column values for linkage. This
automated scheme mapping to
increase new data ingestion. For highly
regular fields (like names, date of birth,
emails, etc.), data matching accuracy
was 96%, even with field name
variation. (Ex first <> name).
1Model Verification
2Inferring Semantic Schema
Relations
3LSTM using Semantic
Representation
We found best accuracy using a
recurrent neural network that applied a
semantic representation of each entity
to determine potential linkages. We
used language level transfer learning
leveraging the FastText to identify the
semantic meaning of potential related
node values. Industry standard linkage
using Initial implementations of our
model were able to achieve up to 93%
accuracy.
TechnologiesModelMirrored LSTM
After an iterative approach of
industry standard heuristic
and statistical methods, we
opted for a deep learning
solution to meet the
complexity of the dirty and
inconsistent data set. We
based our solution on
MassMutual’s industry leading
approach but we were able to
get higher accuracy.
Cas
e St
ud
y
SectorCRM/ERP
VerticalSales Optimization
Th
e P
rob
lem
Three: Calculate a CPC that Promotes Your Goals
Stage three combines the estimated dollar value determined from stage one’s artificial intelligence-powered PPC calculations and the cost ecosystem analyzed in stage two. The decision engine applies the advertisers targets, bidding strategies, and goals and then runs through and selects the best bids to maximize performance, given the data, calculations, and goals. Often, a Portfolio approach is used to bid against a target while maintaining an efficiency metric. This is the modern approach to PPC bid optimization that most bid management tools utilize – if they’re designed for medium to large SEM programs. QuanticMind differs in some ways from legacy tools, discussed further in the Guide.
Four: Calculate Bid Adjustments
Stage four repeats nearly the same process completed in the first three steps, but on a different set of data and with a different purpose: calculating and automatically applying bid adjustments. QuanticMind’s model shines at this point, using machine learning to optimize bid adjustments at scale. Device Bid Modifiers, Geo Location Bid Modifiers, and Audience Bid Modifiers can all be automatically calculated and applied, based on their relative successes in the SEM program. The data science algorithms used here are another advantage when attempting to calculate optimized bids at scale.
Five: Anomaly Detection
Stage five moves into the often understated – but highly important – anomaly detection. This is one of several areas where the infrastructure discussed at the top can “flex” it’s strength. When designed for effective capturing, cleaning, and piping of data from any source, the system provides better data for better execution. However, the opposite has negative effects: when data is missing, or seems different than forecasts would suggest is reasonable, the performance can take a hit. Fully optimized bidding platforms prevent these problems by using multiple anomaly detection and issue-prevention steps, ensuring bids aren’t pushed based on bad data.
Technologies
A middle market PE firm needed help integrating 4 acquired CRM/ERP companies. They introduced
Blue Orange to the CEO of the merged company to provide architectural guidance on their data
infrastructure to support unified data and sales optimization. Due to disparate data sets, the
company had no insight into the efficacy of their upper funnel engagement or attribution across
their sales cycle.
StageDue Diligence Audit
1. Siloed data systems hinder coordination, planning, and tracking.
2. Low conversion on sales efforts.3. Lack visibility into sales processes.4. Development team has no resources for an
internally focused, stand alone project.
1. Scalable architecture creates confidence that data driven operations will not be outpaced by growth.
2. Increase top of funnel conversion using ML prediction to improve lead segmentation.
3. Improve sales modeling and oversight with real-time, full funnel dashboards.
4. Getting quick results was crucial. They needed to solve the problem quickly and then add complexity later.
Blue Orange Design ConsiderationsBusiness Challenge
Blue Orange helped build the first production prototype of PingThings’ PredictiveGrid. The PredictiveGrid is an Advanced Sensor Analytics Platform (ASAP) architected to ingest, store, access, visualize, analyze, and train machine learning and deep learning algorithms with sensor data measuring the grid with nanosecond temporal resolution.
Initial predictive problems addressed:
● Rapid post-event analysis and reporting● Sensor data cleaning and management● Fault detection, prediction, and localization● Anomaly identification, classification, and prediction● Failure signature identification
Cas
e St
ud
y
SectorEnergy
VerticalAnalytics
Th
e P
rob
lem
Three: Calculate a CPC that Promotes Your Goals
Stage three combines the estimated dollar value determined from stage one’s artificial intelligence-powered PPC calculations and the cost ecosystem analyzed in stage two. The decision engine applies the advertisers targets, bidding strategies, and goals and then runs through and selects the best bids to maximize performance, given the data, calculations, and goals. Often, a Portfolio approach is used to bid against a target while maintaining an efficiency metric. This is the modern approach to PPC bid optimization that most bid management tools utilize – if they’re designed for medium to large SEM programs. QuanticMind differs in some ways from legacy tools, discussed further in the Guide.
Four: Calculate Bid Adjustments
Stage four repeats nearly the same process completed in the first three steps, but on a different set of data and with a different purpose: calculating and automatically applying bid adjustments. QuanticMind’s model shines at this point, using machine learning to optimize bid adjustments at scale. Device Bid Modifiers, Geo Location Bid Modifiers, and Audience Bid Modifiers can all be automatically calculated and applied, based on their relative successes in the SEM program. The data science algorithms used here are another advantage when attempting to calculate optimized bids at scale.
Five: Anomaly Detection
Stage five moves into the often understated – but highly important – anomaly detection. This is one of several areas where the infrastructure discussed at the top can “flex” it’s strength. When designed for effective capturing, cleaning, and piping of data from any source, the system provides better data for better execution. However, the opposite has negative effects: when data is missing, or seems different than forecasts would suggest is reasonable, the performance can take a hit. Fully optimized bidding platforms prevent these problems by using multiple anomaly detection and issue-prevention steps, ensuring bids aren’t pushed based on bad data.
Technologies
PingThings was a startup looking to build a real-time platform to leverage machine-learning for physical systems on the electric utility grid and high-value industrial assets such as GSU transformers and step-down transformers. They wanted an analytics platform to track sensor data, focusing on storing and manipulating time-series data and modeling complex relationships between synchrophasors' high-resolution signals.
ModelLightGBM/XGBoost
12Data ScienceAdopt AWS SageMaker and complementary services
DATA SCIENCE INFRASTRUCTURE
• Use AWS SageMaker to get access to the standard Python Data
Science stack
• Jupyter Notebooks
• Numpy / Pandas / SciPy / etc.
• Scikit-learn for initial ML efforts
• Benefits:
• Serverless, on-demand infrastructure
• Huge ecosystem of libraries
• Defined workflow for deployment and continuous
improvement
MACHINE LEARNING INFRASTRUCTURE
• Use AWS SageMaker to instantly get access to all cutting-edge ML
stacks
• Jupyter Notebooks
• TensorFlow / PyTorch / XGBoost / etc.
• Use built-in AWS SageMaker features for labeling, training and
deploying models to live endpoints
• Benefits:
• Serverless, on-demand infrastructure
• Huge ecosystem of libraries
• Defined workflow for deployment and continuous
improvement
• Use the new Data Science / ML Infrastructure to improve
automation
• Keyword Tagging: Apply modern topic modeling and clustering
techniques
• OCR: custom solution trained on available data corpus to achieve
higher accuracy and recall
• Observation Extraction: Apply deep-learning based information
extraction
REPORTING & METRICS
• Establish Metrics / KPIs with key stakeholders
• Expose to stakeholders via a BI Reporting Tool (ex. PowerBI, AWS
QuickSight, Tableau)
• Benefits:
• Serverless, on-demand infrastructure
• Data democratization – give stakeholders a view into data
science efforts
THE ADVANTAGES OF WORKING WITH BLUE ORANGE
Industry-Leading Data Architecture
Data science starts with data cleaning and preparation.
We implement modern scalable data architecture from the start to expedite
ongoing analytics.
Clear Project Insight
We take the mystery out of data science
implementation with clear project insight and
non-technical project communication.
1 2 3
Access to Specialized Talent
Bring on a Ph.D data scientist for two weeks to work on a predictive customer segmentation
problem without the hiring challenges or
ongoing commitment.
Work-for-Hire
Leverage our decades of experience while building
something lasting in-house. We work with both existing technical
teams or as a standalone resource.
4
Uiba offers Machine Learning for
Organizational Management to medium and
large-sized organizations. This platform
enables organizations to hire, allocate, and
develop their workforce in a manner designed
to maximize productivity, minimize cost, and
achieve optimal efficiency. Blue Orange
developed and designed the first version of
their platform.
Blue Orange was instrumental in helping our
company achieve the early breakthroughs
necessary to get Uiba where we are today.
Josh and his team provided a great deal of
insight beyond the blocking and tackling of
development work, which helped us avoid
unnecessary but costly mistakes as we
responded to customer needs. Their work
was always top notch and delivered on time.
- Jason Cowell CEO of Uiba
Blue Orange Digital designed a cutting-edge
hiring and recruiting platform using machine
learning to optimize sourcing. We also used the
data analysis tools to identify high-value
applicants and optimize candidate funnel.
Over the course of my career, I’ve worked with
at least a dozen technology teams, and it is
without question that the Blue Orange team
stands above them all. It’s not just that they
believe in using frontier technologies or that
the expectation is constant learning and
improvement, but his sense of product, the
insights they provide, enables a product to be
truly usable and sticky. As a manager, perhaps
most valuable is the level of transparency the
team provides about progress and
deadlines. With other tech teams, it can be
excruciating to extract plans, in-depth updates
or explanations for issues as they arise. The
Blue Orange team is a partner, collaborator,
and leader.
- Lauren B., Executive at Point72 Asset
Management
TestimonialsProject Bio
Testimonial
Project Bio
Testimonial
Trusted by Fortune 500s and Innovative AI Companies
Our Leadership
Chief Executive Officer
Josh Miramant Colin Van Dyke Chief Technology
Officer
Lead Data ScientistDr Uri Schonfeld
Our Employees Come From:Trusted By Leading Brands:
A Data Science Agency
79 Madison AveNew York, New York
10017
(530) 454-5830
blueorange.digital