Azure OpenAI Document Extracts and Notes

Featured

Posted on July 27, 2023 by yung chou

OVERVIEW

Azure OpenAI is a service provided by Microsoft Azure that allows users to access OpenAI’s powerful language models, including the GPT-3, Codex, and Embeddings model series. Users can access the service through REST APIs, Python SDK, or a web-based interface in the Azure OpenAI Studio.
Azure OpenAI Service gives customers advanced language AI with OpenAI
- GPT-4, GPT-3, Codex, and DALL-E
- Models with the enterprise security and privacy of Azure.
Azure OpenAI co-develops the APIs with OpenAI, ensuring compatibility and a smooth transition from one to the other
Azure OpenAI Infographic

Comparing Azure OpenAI and OpenAI

Enterprise-grade security with role-based access control (RBAC) and private networks
Essentially Security, Privacy, and Trust
Microsoft values a customer’s privacy and security of data. When using Azure AI services, Microsoft may collect and store data to improve the session experience and supportability of models. However, customer data is anonymized and aggregated to protect individual privacy.
Microsoft does not use customer data for fine-tuning or customizing models for individual users.
Microsoft Responsible AI Standard (PDF Download)
- The Responsible AI Standard is the product of a multi-year effort to define product development requirements for responsible AI.
- For providing feedback regarding Responsible AI at Microsoft

Responsible AI

For building AI systems according to six principles:
- Fairness and Inclusiveness
  - Make the same recommendations to everyone who has similar symptoms, financial circumstances, or professional qualifications.
- Reliability and Safety
  - Operate as originally designed, respond safely to unanticipated conditions, and resist harmful manipulation.
- Privacy and Security
  - Restrict access to resources and operations by user account or group.
  - Restrict incoming and outgoing network communications.
  - Encrypt data in transit and at rest.
  - Scan for vulnerabilities.
  - Apply and audit configuration policies.
  - Microsoft has also created two open-source packages that can enable further implementation of privacy and security principles: SmartNoise and Counterfit
- Transparency and Accountability
  - The model interpretability component provides multiple or global, local, and model explanations/views into a model’s behavior.
  - The people who design and deploy AI systems must be accountable for how their systems operate.

SECURITY AND PRIVACY

Azure OpenAI Service automatically encrypts your data when it’s persisted to the cloud, using FIPS 140-2 compliant 256-bit AES encryption.
By default, Microsoft-managed encryption keys are used, but you also have the option to use customer-managed keys (CMK) for greater control over encryption key management.
The Files API allows customers to upload their training data stored in Azure Storage, within the same region as the resource and logically isolated with their Azure subscription and API Credentials. Uploaded files can be deleted by the user via the DELETE API operation.
With Azure OpenAI, customers get the security capabilities of Microsoft Azure while running the same models as OpenAI. Azure OpenAI offers private networking, regional availability, and responsible AI content filtering.
- Azure OpenAI Service contains neural multi-class classification models aimed at detecting and filtering harmful content; the models cover
  - four categories: hate, sexual, violence, and self-harm across
  - four severity levels: safe, low, medium, and high.
- The default content filtering is default to filter at the medium severity threshold for all four content harm categories for both prompts and completions. That means that content that is detected at severity level medium or high is filtered, while content detected at severity level low is not filtered by the content filters. The configurability feature is available in preview and allows customers to adjust the settings, separately for prompts and completions, to filter content for each content category at different severity levels.

AZURE OPENAI MODELS

Azure OpenAI provides access to models with various capabilities. The following is a list of the models and their descriptions:

GPT-4 (8k/32k): A set of models that improve on GPT-3.5 and can understand as well as generate natural language and code.
- apply for access by filling out this form.
GPT-3 (4k/16k): A series of models that can understand and generate natural language. This includes the new ChatGPT model.
DALL-E: A series of models that can generate original images from natural language.
Codex: A series of models that can understand and generate code, including translating natural language to code.
Embeddings: A set of models that can understand and use embeddings. An embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Currently, we offer three families of Embeddings models for different functionalities: similarity, text search, and code search.

AZURE OPENAI ON YOUR DATA

With Azure OpenAI GPT-35-Turbo and GPT-4 models, enable them to provide responses based on your data. You can access Azure OpenAI on your data using a REST API or the web-based interface in the Azure OpenAI Studio to create a solution that connects to your data to enable an enhanced chat experience.

Per the document, Azure OpenAI on your data, Azure OpenAI Service supports the following file types:

File type	Extension
Text	.txt
Markdown	.md
HTML	.html
Word	.docx
PowerPoint	.pptx
PDF	.pdf
CSV	.csv
TSV	.tsv
Excel	.xlsx
JSON	.json
JSONL	.jsonl

QUICKSTART

Previous models were text-in and text-out, meaning they accepted a prompt string and returned a completion to append to the prompt. However, the GPT-35-Turbo and GPT-4 models are conversation-in and message-out.

TRAIN MODEL

TOKEN

Azure OpenAI processes text by breaking it down into tokens. Tokens can be words or just chunks of characters. For example, the word “hamburger” gets broken up into the tokens “ham”, “bur” and “ger”, while a short and common word like “pear” is a single token. Many tokens start with a whitespace, for example “ hello” and “ bye”.

The total number of tokens processed in a given request depends on
- the length of your input,
- output and
- request parameters.

The quantity of tokens being processed will also affect your response latency and throughput for the models.

Azure OpenAI Pricing

Pricing will be based on the pay-as-you-go consumption model with a price per unit for each model, which is similar to other Azure AI Services pricing models.

Azure Service Availability

SLA: This describes Microsoft’s commitments for uptime and connectivity for Microsoft Online Services covering Azure, Dynamics 365, Office 365, and Intune.

Quota and Limits

PLAYGROUND

The system role also known as the system message is included at the beginning of the array. This message provides the initial instructions to the model. You can provide various information in the system role including:

A brief description of the assistant
Personality traits of the assistant
Instructions or rules you would like the assistant to follow
Data or information needed for the model, such as relevant questions from an FAQ

You can customize the system role for your use case or just include basic instructions. The system role/message is optional, but it’s recommended to at least include a basic one to get the best results.

Azure OpenAI Is Ready, Are You?

Posted on June 8, 2023 by yung chou

Azure OpenAI can be utilized for a wide range of tasks that cater to both business and technical requirements. It offers various capabilities, including but not limited to:

Content Generation: Azure OpenAI can generate high-quality and coherent text content for a variety of purposes, such as writing articles, product descriptions, marketing materials, and more. It can help automate content creation and save time and effort.

Summarization: With Azure OpenAI, you can extract key information and generate concise summaries from large volumes of text. This can be particularly useful for processing lengthy documents, news articles, research papers, or any content that requires distilling important points.

Semantic Search: Azure OpenAI enables semantic search capabilities, allowing you to perform more advanced and accurate searches based on the meaning and context of the query. This can improve search results by understanding the intent behind the search terms, resulting in more relevant and targeted information retrieval.

Natural Language to Code Translation: Azure OpenAI can assist in translating natural language queries or instructions into executable code. This feature can be helpful for developers and non-technical users alike, allowing them to express their requirements in plain language and receive code snippets or solutions that align with their intentions.

In summary, Azure OpenAI offers a powerful suite of tools for content generation, summarization, semantic search, and translating natural language to code. It empowers businesses and individuals to leverage advanced AI capabilities to automate tasks, enhance productivity, and unlock new possibilities in various domains.

Here’s how to start using Azure OpenAI services:

Task	Description
Accessing Azure OpenAI	To access Azure OpenAI, you need to create an Azure subscription and apply for access to the Azure OpenAI service by completing the form at https://aka.ms/oai/access
Azure OpenAI Studio	Azure OpenAI provides a web-based interface in the Azure OpenAI Studio to access OpenAI’s powerful language models including the GPT-3, Codex and Embeddings model series.
Python SDK	Azure OpenAI provides a Python SDK to access the service.
Quotas and Limits	Azure OpenAI has certain quotas and limits that apply to the service, such as the number of requests per second per deployment and the total number of training jobs per resource.
Business Continuity and Disaster Recovery (BCDR)	Azure OpenAI provides BCDR considerations for implementing BCDR with Azure OpenAI.

References:

My presentation on 20191113

Posted on November 14, 2019 by yung chou

Four topics I talked about in this presentation :

VM preparation for migrating from on-premises to cloud
High Availability, Disaster Recovery, and Scalability
Azure Internet-of-Things (IoT) edge computing
Machine Learning (ML) application development

R Parallel Processing for Developing Ensemble Learning with SuperLearner

Posted on January 16, 2019 by yung chou

Ensemble learning which included multiple learners, i.e. Machine Learning algorithms, may take much longer time than expected to develop. When using a search grid for parameter optimization to train an ensemble, depending on the included algorithms, the number of variables, and the corresponding iterations based on combinations of parameter settings and with cross validation, it may take a while to produce results, assuming not running out of computing resources.

For me, parallel processing becomes essential for working on a Machine Learning project. Speaking from my experience, while developing ensemble learning with SuperLearner, in one scenario, there were total 112 learners generated from a test grid. And the wait time was just too long to maintain productivity. Later I ran parallel processing in a single host to speed up the process. I used 3 cpus of my i7-16 GB RAM laptop for small test runs and 15 vcpus of an Azure D16 Series VM with 64 RAM, as shown above, for training stable models with large amount of data. Notice that despite multiple SuperLearner sessions can run concurrently in a host with multiple cpus, within SuperLearner the process remains sequential (Ref: page 7, ‘parallel’, SuperLearner document dated Aug. 11, 2018). So using multiple cpus should and did overall reduce the elapsed time linearly (i.e. 3 cpus to cut the elapsed time to 1/3) based on my experience.

The following is one sample configuration for running parallel processing in R in Windows environment, which I employed SuperLearner for training an ensemble of ranger and xgboost. Prior to this point, I had already

prepared and partitioned the data for training and testing (x.train, y.train, x.test, and y.test) where y is the label,
configured the test grids (ranger.custom and xgboosst.custom) with function names resolved by SuperLearner

Upon finishing, the code also saved the run-time image as RDS object for later subsequent tasks to read in the image and eventually make predictions. Since SuperLearner does not have a built-in function to report the time for cross validation, I wrapped the cross validation part with system.time.

With the additional operation details in preparing and training an ensemble, this code is not a plug-and-play sample. If you are new to SuperLearner, I highly recommend first reviewing the package, parallel, and taking time to practice and experiment. On the other hand, if you have already had an ensemble model developed with SuperLearner, this sample code may be a template for converting existing training/model-fitting from sequential execution into a configuration for parallel processing. And stay tuned for my upcoming post, Part 2 of Predicting Hospital Readmissions with Ensemble Learning, with additional details on developing ensemble learning.

if (!require('parallel')) install.packages('parallel'); library(parallel)

# Create a cluster using most CPUs
cl <- makeCluster(detectCores()-1)

# Export all references to cluster nodes
clusterExport(cl, c( listWrappers()
  ,'SuperLearner' ,'CV.SuperLearner' ,'predict.SuperLearner'
   ,'nfold','x.train' ,'y.train' ,'x.test' ,'y.test' ,'family','nnls'
  ,'SL.algorithm' ,ranger.custom$names ,xgboost.custom$names
  ))

# Set a common seed for the cluster
clusterSetRNGStream(cl, iseed=135)

# Load libraries on workers
clusterEvalQ(cl, { 
  library(SuperLearner);library(caret); 
  library(ranger);library(xgboost)
 })

# Run training session in parallel
clusterEvalQ(cl, {
  ensem.nnls <- SuperLearner(Y=y.train ,X=x.train
     ,family=family ,method=nnls ,SL.library=SL.algorithm
     );saveRDS(ensem.nnls ,'ensem.nnls')
})

# Do cross validation in parallel
 system.time({
   ensem.nnls.cv <- CV.SuperLearner(Y=y.test ,X=x.test
     ,cvControl=list(V=nfold) ,parallel=cl
     ,family=family ,method=nnls ,SL.library=SL.algorithm
     );saveRDS(ensem.nnls.cv ,'ensem.nnls.cv')
   })

stopCluster(cl)

Predicting House Price with Multiple Linear Regression

Posted on December 31, 2018 by yung chou

House Price Prediction

This project was to develop a Machine Learning model for predicting a house price. Despite there were a number of tree-based algorithms relevant to this application, the project was to examine linear regression and focused on specifically four models: Linear Regression, Ridge Regression, Lasso Regression and Elastic Net.

Overview
Data Analysis
Feature Selection
Data Visualization
Model Development
Model Comparisons
Closing Thoughts

Overview

(back)

In this article, “variable”” as a general programming term and “feature” denoting a predictor employed in a Machine Learning model are used interchangeably. The following outlines my approach and highlights the logical steps which I followed for developing a Machine Learning models. The development process was highly iterative and the presented steps were not necessarily the exact order. Nevertheless, these steps correctly depict the thought process and overall strategies for developing a Machine Learning model.

Data Set The data set was downloaded from Kaggle House Prices: Advanced Regression Techniques. There were two files: train.csv with 1460 observations and 81 variables, while test.csv with 1459 observations and and 80 variables.
Missingness There were a few variables with considerable amount of missing values, essentially unusable and removed from subsequent process. Those missing at random were later imputed with values.
Character variables Factor variables were read in as character ones. Some character variables with several unique values. They were converted into ordinal and minimized into two or three levels for later imputing missing values and selecting features programmatically.
Numeric Variables There were some extreme values among numeric variables in the train data set due to the way the values were captures. For instance, those measures such as deck, porch or pool ranges from 0 when not applicable to hundreds in squared footage. When modeling these variables as predictors, those with large values might overwhelm and skew the model. These variables were minimized to just a few levels and converted to numbers better reflect real-world scenarios. Above all, the strategies to select what and determine how to convert a variable have much to do with the composition and distribution of the data. Often, the values of a variable are not as significant as the variance of those values.
Extreme Values and Outliers Not all variables with values larger than 1.5 IQR were removed from the train data set. Some of these extreme values appeared characteristic and influential to some model configurations. In a few test runs, removing outliers or those observations resulting residuals with much leverage actually decreased Rsquared values. For a data set, like the Kaggle House Price, the interactions among variables can be intricate since there are many variables. Making one change at a time, documenting the changes well, and backing up the settings often are the lessons I have learned well from handling outliers and extreme values of this project.
Imputation of Missing Values Used the package, Multivariate Imputation by Chained Equations (mice), for imputing values programmatically throughout the development.
Feature Selection Used the package, Boruta: Wrapper Algorithm for All Relevant Feature Selection, to initially selecting features. Subsequently, removed insignificant features from the model based on the significance level of test runs. This process was iterative and carried out along with model development. As test statistics confirming the impact or importance of a feature, it was restored or removed accordingly. An example of running Boruta is available.
Near Zero-Variance Variables A variable with little variance behaves like and is essential a constant with values distributed near its mean. A constant-like or near zero-variance variable contributes little to a Machine Learning model since little correlation with an outcome, namely a prediction, of applying changes to the model. With the package, Classification and Regression Training (caret), once can identify and process a near zero-variance variable programmatically.
Partitioning Data Partitioned the train data set into 70/30 where 70% for training and 30% for evaluating the model.
Cross-Validation Used 10-fold cross-validation in all training and with 5 repetitions.
Linear Model Overall, simply including all variables in a linear model without interaction between variables could achieve Rsquared value above 80%, while the model remained unstable. Adding relevant interaction variables improved the model noticeably with stability. However, the model seemed reach its limitation in current configuration when Rsquared near 91%.
Ridge, Lasso and Elastic Net Tried various combinations and ranges of lambda and alpha values to find sets of tuning parameters. This process was in some way experimental due to the results were based on the combined effect of the seed value for randomness, the starting and the end points of lambda and alpha, and the step size. In Elastic Net, although various settings resulted in various sets of turning parameters, the overall Rsquared values of the elastic model remained stable.
Model Comparisons Comparing the four models: Linear, Ridge, Lasso, and Elastic Net showed Lasso was influential and largely adopted by Elastic Net in the developed model.
Predictions Although the main objective of the project was to examine and analyze linear regression and not necessarily engineer for a high Kaggle score. Submissions made resulted to .014 range with predictions made by the Elastic Net model.

Data Analysis

(back)

Kaggle House Price Dataset

Downloaded and imported the train data set. Here’s some information by examining the structure and the summary.

[1] “Imported train data set: 1460 obs. of 81 variables”

Missingness

Next examined the distribution of missingness and the percentage of missing values. There were a few variables with most observations missing, which made these variables not usable and they were consequently removed. Here’s a visualization of missingness of the train data set.

Percentage of Missing Values

Further examination of the percentage of missing values of each variable revealed:

Feature Selection

(back)

Removed a set of variables at this time based on:

a large percentage of missing values which made a variable not usable
feature importance confirmed by Boruta
a consistent insignificant level of p-value as a predictor in test runs

Boruta

After having converted all variables in train dataset to integer or numeric fields, programmatically imputed the data for missing values, I ran Boruta to initially analyze the importance of variables. And it took about 40 minutes in the context to iterate 500 times and produced something like the following results where those in green were with confirmed importance, while red rejected, i.e. not important features. The yellow ones were tentative which were not yet resolved before reaching the set number of iterations.

Stored the list of features confirmed by Boruta and subsequently removed these features not included in this list from the train dataset.

Features with Insignificant P-Values

While developing, fitting, and tuning the model, I documented a list of features consistently with insignificant p-values, i.e. greater than 0.05, in test runs. Below is a snapshot of these features to be removed form train dataset prior to executing a test run. Notice these features were not a unique set and various development paths and configurations could and would result a different set of features.

Character Variables

Factor variables were read in as character ones. Rather than converting into factor variables, they were converted into integer or numeric fields for later imputing data as well as deriving feature importance programmatically.

The above, for example, showed the variable, BldgType, was a character variable with five unique levels. It was converted into an ordinal one with values between 1 and 2. Notice that the process was iterative during data preparation and feature engineering. Both converting and combining variables were considered. Domain knowledge, subjectivity, and common sense were all relevant to the what and how to convert a variable, as applicable. The technique and strategies can and will vary from person to person and model to model.

Numeric Variables

For numeric variables, their values can produce unintended effects. For instance, assume modeling a house price having a linear relationship with the month a house is sold. In such case, a generalization is essentially inherited into the model, that a house sold in December with a value of 12 would contribute 12 times more to a response variable than one sold in January with a value of 1. This configuration fundamentally does not correctly reflect the seasonality, nor the degree of impact on a house price based on the month a house is sold.

One alternative way of modeling seasonality is, as shown above, to convert the variable values to a scale between 0 and 1 where in the summer, i.e. July to September, with the most weight contributing to the market house price, the response variable, and in the winter time the least weight to signify the slow period.

Later, this feature was removed from the final model due to insignificance consistently denoted by p-values in a series of test runs. Still, it was necessary to make the effort to prepare the data and convert this variable, from a January-to-December as 1-to-12 scale to a more meaningful and realistic one for describing real-world scenarios. With a proper scale of this and other similar variables, packages like mice could calculate meaningful values for imputation and Boruta for deriving feature importance.

Above all, the strategies to determine what and how to convert a variable have much to do with an examiner’s domain knowledge, subjectivity, and common sense in addition to reviewing the composition and distribution of the data.

And the values of a variable sometimes do not tell the whole story. It may not be the values of a variable, but the variance of those values plays a more influential role for making predictions.

Data Visualization

(back)

Up to this time, I had an initial set of features to start working on developing a model. Throughout the development, I would make changes of the feature set and observations based on diagnostics of the test results. The presented series of plots were generated along the development process.

Along the development, I produced multiple versions and configurations of the following plots. The set presented here is just one of the many.

Prepared Train Dataset

Here’s a snapshot of the prepared data set ready for Machine Learning development.

Distribution of the Label

The label, i.e. response variable, was SalePrice, here plotted without logarithm.

Label vs. Feature

To examine a feature relevant to the label, SalePrice, plotted each pair individually. The linearity among variables was obvious.

Pairs.Panels

Here’s a pairs.panels plot with all features and the label. This plot gives an overview of the linearity between variables and the variance of individual variables.

Correlation Matrix

These three plots: correlation matrix, label vs. feature, and pairs.panels were my main references for developing an initial model.

Partitioning Data

I partitioned the train dataset into a 70-30 split where 70% for training and 30% for testing. Here is a set of plots produced by fitting the four regression models: Linear, Ridge, Lasso, and Elastic Net.

1. Linear Model

(back)

Here’s a summary of lm for one of the runs. The adjusted R-squared was 0.9067 with insignificant features removed.

1.1 Diagnostic Plots

The diagnostic plots played an important role in the initial development. Form the Residuals vs. Fitted plot, there seemed some nonlinearity. Many changes and adjustment made were based on examining and interpreting these plots. In each iteration, I reviewed the plots and changed the composition of features and interactions, removed outliers or added back observations, etc. followed by more test runs. The process was highly iterative and the productivity relied much on well documentation to facilitate the analysis and restore a configuration when needed.

1.2 Variable Importance and Distribution of Residuals

1.3 Predicted vs. Observed

2. Ridge Regression

Set alpha=0 and a sequence for tuning Lambda. I started from a wide range like 0.001 to 100 and gradually reduced the range to find a good window. The size of a step sometimes had a noticeable effect on the outcome. Many experimentation and repetitions happened here.

2.1 Regularization

2.2 Variable Importance and Distribution of Residuals

2.3 Predicted vs. Observed

3. Lasso Regression

Set alpha=1 and a sequence for tuning Lambda. Like what I did in Ridge Regression, I started from a wide range and gradually reduced to and identified a good range and step to scan.

3.1 Regularization

3.2 Variable Importance and Distribution of Residuals

3.3 Predicted vs. Observed

4. Elastic Net

Initially I set one sequence for tuning both alpha and lambda. This turned out not productive for me. Since in a configuration the two values were far apart from each other, the range for scanning would become relatively extensive with a small step sometimes necessary to initially locate the values. A few times my laptop would run out of resources and simply not responding later in a run.

Setting an individual sequence for alpha and Lambda was a more productive approach for me. Nevertheless, the increased combinations and with 10-fold cross validation, it took longer and a few iterations to narrow the ranges and locate the best set of alpha and lambda.

4.1 Regularization

With these many features, overfitting would be likely as these plots revealed.

4.2 Variable Importance and Distribution of Residuals

4.3 Predicted vs. Observed

Model Comparisons

(back)

Other than Ridge Regression, the rest three performed very much at the same level.

Summary of Models

Predicted vs. Observed

Placing all four models together, Elastic Net apparently favored Lasso Regression and the pattern are almost identical. While Linear, Lasso, and Elastic Net all have a very similar pattern, the color nevertheless shows there were subtle differences in density.

With a baseline model in place, the fun has just got started. Using test.csv, the submission file provided by Kaggle, start fine-tuning and improving the model, submit and score.

Closing Thoughts

(back)

Considering this model employed just multiple linear regression, I was surprised that the scores turned out to be higher than expected, based on a few submissions I have done. Linear regression is conceptually simple and relevant to many activities happening in our daily life. We all do linear regression in our mind when making a purchase. Is this expensive or cheap? Every time, we ponder that thought, we are doing linear regression in some shape and form.

We must however not mistakenly and carelessly assume linear regression is as simple as it appears, as I have learned from my own mistake. There is much to investigate and learn from linear regression. Ordinary Least Square (OLS) which linear regression is built upon is too fundamental to overlook. The simplicity of OLS offers a clear strategy and enables Machine Learning algorithms to describe the combining effects of a set of predictors based on the distance. The concept of residuals is simple, approach straightforward, and objective clear. Ultimately, we want to minimize the distance of what is observed and what is predicted. This distance is our cost or error function.

There are a few options to continue the development. Tree-based models, ensemble learning, further refining and optimizing the data, more feature engineering, etc. are all applicable. With these many variables, a tree-based model should have a good story to tell. Which is what I plan to try next.

Predicting Hospital Readmissions with Machine Learning (Part 1): Data Preparation

Posted on December 12, 2018 by yung chou

Data Preparation of Diabetes Dataset

Due to web page limitation, this post has been moved to https://yungchou.github.io/site/

Feature Selection with Help from Boruta

Posted on November 19, 2018 by yung chou

Why

When developing a Machine Learning model, if there is a significant number of features to inspect, an initial and manual Exploratory Data Analysis may become tedious and nonproductive. One option is to facilitate the process by testing and identifying important variables based on statistical methods to help trim down features. And that is where Boruta comes in place.

What

A forest spirit in the Slavic mythology, Boruta (also called Leśny or Lešny) was portrayed as an imposing figure, with horns over the head, surrounded by packs of wolves and bears. Fortunately, in R Boruta is a helpful package for facilitating a feature selection process. Here’s a description from the documentation:

Boruta (CRAN) is an all relevant feature selection wrapper algorithm, capable of working with any classification method that output variable importance measure (VIM); by default, Boruta uses Random Forest. The method performs a top-down search for relevant features by comparing original attributes’ importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilize that test.

How

The following is a sample routine in R demonstrating how I used Boruta to find a starting point for features selection. Some noticeable settings include:

The input data was train.imp.
doTrace=2 will log the activities and show progress to console.
maxRuns is how many times Boruta should run. In some circumstances (too short Boruta run, unfortunate mixing of shadow attributes, tricky dataset. . . ), Boruta may leave some attributes Tentative. For my particular case, the first 100 runs (which is a good initial value to start) confirmed most of the features with a few remain tentative. And I set it to 500 to finally resolve 80 features I was interested in and it took about half an hour.
TentativeRoughFix performs a simplified, weaker test for judging such attributes. This function should be used with discretion, since this weak test can lower the confidence of the final results.
getSelectedAttributes does what it sounds like.
attStats keep the statistics and the result of each resolved variable.

library(Boruta)

set.seed(1-0)
train.boruta <- Boruta(SalePrice~., data=train.imp, doTrace=2, maxRuns=500)

print(train.boruta)
plot(train.boruta , las=2, cex.axis=0.7, xlab='')
#plotImpHistory(boruta)

train.boruta.fix <- TentativeRoughFix(train.boruta)
train.boruta.selected.features <- getSelectedAttributes(train.boruta.fix, withTentative = F)

saveRDS(train.boruta.selected.features,'boruta/train.boruta.selected.features.rds')

train.boruta.selected.features.stats <- attStats(train.boruta.fix)
saveRDS(train.boruta.selected.features.stats, 'boruta/train.boruta.selected.features.stats.rds')

Also included are plots of Boruta output and attStats. Those confirmed important were in green and rejected in red. Unresolved variables were in yellow and classified as tentative which Boruta was not able to conclude their importance. And attStats kept and reported the statistics associated with the decisions.

Boruta uses Random Forest algorithm to provide educated sets of important and not so important features, respectively. Not only save time, but offer a repeatable and automatic way for initial exploratory data analysis.

Closing Thoughts

Feature selection is a critical task in developing a Machine Learning model. Extraneous features introduce multicollinearity, increase variance and lead to overfitting. Data is everything and feature selection is as critical. This is a task that can consume much of model development time. And for me, making the routine a code snippet and getting the mechanics in place help me become productive much quicker. A next logical step is to programmatically consume and integrate Boruta output to build and train a preliminary Machine Learning model to possibly establish a baseline of a target algorithm. Stay tuned for that.

Microsoft Cortana Intelligence Suite Workshop Video Tutorial Series (5/5): Predictive Web Service

Posted on September 1, 2017 by yung chou

The last part of this video tutorial series includes three exercises. First, Exercise 6 uses Power BI Desktop, import the summary data from the Spark cluster and create a report with drag-n-drop to visualize the data. Exercise 7 is the exciting part, configures and deploys a sample web app and configures it to consume the predictive web service published in Exercise 1, followed by conducting a few simple tests. Finally, Exercise 8 shows how to clean up the deployed resources of the workshop.

Here you start.

Microsoft Cortana Intelligence Workshop encompasses a set of processes and supporting tools to architect, construct, package and deploy a predictive analytics solution. It is a friendly platform with no hardware to purchase, no software to configure. The workshop ultimately deploys a web application with a predictive analytics service. The app predicts the total number and the probability of flight delays between two cities based on date, time, carrier and real-time forecast weather information. It is a relative simple project, however includes all the essential components to formulate a modern and intelligent application.

Microsoft Cortana Intelligence Suite Workshop Video Tutorial Series by Yung Chou

The workshop is intended to be delivered as a whole-day event with presentation sessions and lab time. On the other hand, within 75 minutes the above video tutorial series can also offer you an experience and guide you through all the screens and interactions to successfully deploy the web service.

The next step is to apply what learned from this series to your work. Good luck.

Microsoft Cortana Intelligence Suite Workshop Video Tutorial Series (4/5): Azure Spark Cluster

Posted on September 1, 2017 by yung chou

The objective of Exercise 5 is to create a table, then store and prepare summary data for later visualization. You will find out it is simple and straightforward using a Spark notebook to interactively work on an Azure Spark cluster.

This video tutor series presents the live demonstrations of all the exercises to facilitate the learning of Microsoft Cortana Intelligence Suite. There are 5 parts:

Microsoft Cortana Intelligence Suite Workshop Video Tutorial Series (3/5): Azure Data Factory

Posted on August 30, 2017 by yung chou

Machine Learning, predictive analytics, web services and all the rest to make it happen are really about one thing. And that is to acquire, process and act on data. For the workshop, this is done with a Data Factory pipeline configured to automatically upload a dataset to the storage account of a Spark cluster where Azure Machine Learning is integrated to score the dataset. Importantly, this addresses a fundamental requirement relevant to data-centric applications involved cloud computing. Which is to securely, automatically and on demand moving data between an on-premises location and a designated one in the cloud. For IT today, cloud can be a source, a destination and a broker of data and the ability to securely move data between an on-premises facility and a cloud destination is imperative for a hybrid cloud setting and a backup-and-restore scenarios. And Azure Data Factory is a vehicle to achieve that ability.

The workshop video tutorial series is as listed below:

Specifically, Exercises 2 -4 are to accomplish three things:

Creating an Azure Data Factory service and pairing which with a designated
on-premises (file) server
Constructing an Azure Data Factory Pipeline to automatically and securely
move data from the designated on-premises server to a target Azure blob storage
account
Enabling the developed Azure Machine Learning model to score the date
provided by Azure Data Factory pipeline

Notice that the lab VM is also employed as an on-premises file server hosting a dataset to be uploaded to Azure. At one moment, you may be using the lab VM as a workstation to access Azure remotely, and the next on an on-premises file server installing a gateway. When following the instructions, be mindful where a task is carried out, as the context switching is not always apparently.

Share this:

Share this:

Share this:

Share this:

House Price Prediction

Overview

Data Analysis

Kaggle House Price Dataset

[1] “Imported train data set: 1460 obs. of 81 variables”

Missingness

Percentage of Missing Values

Feature Selection

Features with Insignificant P-Values

Character Variables

Numeric Variables

Data Visualization

Prepared Train Dataset

Distribution of the Label

Label vs. Feature

Pairs.Panels

Correlation Matrix

Partitioning Data

1. Linear Model

1.1 Diagnostic Plots

1.2 Variable Importance and Distribution of Residuals

1.3 Predicted vs. Observed

2. Ridge Regression

2.1 Regularization

2.2 Variable Importance and Distribution of Residuals

2.3 Predicted vs. Observed

3. Lasso Regression

3.1 Regularization

3.2 Variable Importance and Distribution of Residuals

3.3 Predicted vs. Observed

4. Elastic Net

4.1 Regularization

4.2 Variable Importance and Distribution of Residuals

4.3 Predicted vs. Observed

Model Comparisons

Summary of Models

Predicted vs. Observed

Next

Closing Thoughts

Share this:

Data Preparation of Diabetes Dataset

Share this:

Why

What

How

Closing Thoughts

Share this:

Share this:

This video tutor series presents the live demonstrations of all the exercises to facilitate the learning of Microsoft Cortana Intelligence Suite. There are 5 parts:

Share this:

Share this: