Exciting Breakthrough in Data Science!
Ever wondered if you could build a powerful classification model with a staggering 99.3% accuracy in just 10 lines of code? 𤯠Well, itâs not only possible but a reality now! We are not referring to low code but an entirely new language.
Case Study Unveiling: Dive into our latest case study where we accomplished the seemingly unbelievable task. Witness firsthand how our groundbreaking language, DSC, has revolutionized the landscape of data science and machine learning.
We are excited to introduce a new language DSC, a groundbreaking language designed to enhance the productivity of data scientists and ML engineers by 5 to 6 times. DSC, a novel SQL-like language, caters to data analytics, graph generation, report generation, and both supervised and unsupervised machine learning.
APS Failure at Scania Trucks
Introduction
The dataset is divided into two parts: training and test data. The training data is stored in âaps_failure_training_set.csv,â containing 60,000 rows and 171 columns, many of which have missing values. The test data is stored in âaps_failure_test_set.csv,â consisting of 16,000 rows and 171 columns, and it also contains numerous columns with missing values. Except two columns class and aa_000, all columns have missing values.
You can download both training and test data from the following link: https://www.kaggle.com/datasets/uciml/aps-failure-at-scania-trucks-data-set.
The datasetâs purpose is to create a classification model, and the details of the challenge metric are as follows: Cost-metric of miss-classification:
where Cost_1 = 10 and cost_2 = 500
The total cost of a prediction model the sum of âCost_1â multiplied by the number of Instances with type 1 failure and âCost_2â with the number of instances with type 2 failure, resulting in a âTotal_costâ.
In this case Cost_1 refers to the cost that an unnessecary check needs to be done by an mechanic at an workshop, while Cost_2 refer to the cost of missing a faulty truck, which may cause a breakdown.
Total_cost = Cost_1*No_Instances + Cost_2*No_Instances.
First Task: Profile Generation
Let us conduct exploratory data analysis or data profiling on both train data and test data. The profile of training data should be stored in âaps_failure_training_profile.htmlâ with the heading âAPS Failure Training Data Profileâ. The profile of test data should be stored in âaps_failure_test_profile.htmlâ with the heading âAPS Failure Test Data Profileâ.
open: aps_failure_training_set.csv
HTML PROFILE: APS Failure Training Data Profile; aps_failure_training_profile.html
openx: aps_failure_test_set.csv
HTML PROFILE: APS Failure Test Data Profile; aps_failure_test_profile.html
Download Links:
https://dsclang.com/download/aps_failure/aps_failure_training_profile.html
https://dsclang.com/download/aps_failure/aps_failure_test_profile.html
The complete details are provided in the profile. Below, we present a summary of the profile:
1) There are no duplicate rows in the training data and test data.
2) In the training data, there are 60,000 rows and 171 columns. Except for the columns âclassâ and âaa_000,â all columns contain missing values. The percentage of missing values ranges between 0.28 and 82.11.
3) In the test data, there are 16,000 rows and 171 columns. Similar to the training data, except for the columns âclassâ and âaa_000,â all columns have missing values. The percentage of missing values ranges between 0.18 and 82.06.
4) Given the presence of missing values in 169 columns in both the training and test data, our second task is to impute these missing values. Since all rows are related to different trucks, we are employing âiterative imputingâ to achieve high accuracy.
Second Task: Impute Missing Values
As most of the columns in both training and test data have missing values, our second task is to impute missing values. Iterative imputation is often considered better than other imputation methods in certain scenarios due to its ability to handle complex relationships between variables and capture the underlying structure of the data more effectively. In this case, all incidents are independent of each other, so iterative imputation is likely to provide better results.
Details of Methods available for Iterative Imputing in DSC:
(1) ICE â Iterative imputation method based on regularized linear regression.
(2) HyperImpute â Iterative imputer using both regression and classification methods based on XGBoost, CatBoost and neural nets.
(3) MissForest â Iterative imputation method based on Random Forests.
(4) xHyperImpute â Iterative imputer based on using both regression and classification methods based on linear models.
The task is to impute missing values using xHyperImpute method, by ignoring column âclassâ and save training data to âaps_failure_training_imputed.csvâ and test data to âaps_failure_test_imputed.csvâ files.
open: aps_failure_training_set.csv
ITERATIVE IMPUTER: xHyperImpute; class; aps_failure_training_imputed.csv
// Time Taken: 843.072 seconds
openx: aps_failure_test_set.csv
ITERATIVE IMPUTER: xHyperImpute; class; aps_failure_test_imputed.csv
Note: We are unable to share imputed values as the data license does not permit us to do so.
Third Task: Create Classification Model
Our third task is to create classifications model using ensemble of various method wherein the target is âclassâ. While creating classification model, we use 80% data for training and 20% of data for testing by using all the columns except âclassâ. We should store the results to âaps_failure_training_imputed.logâ file and model to âaps_failure_training_imputed.pklâ file.
open: aps_failure_training_imputed.csv
ENSEMBLE CLASSIFICATION: class; ; 0.2; aps_failure_training_imputed.log; aps_failure_training_imputed.pkl
Syntax â ENSEMBLE CLASSIFICATION: (1) Trarget Colum Name; (2) Comma-Separated list of columns to be ignored; (3) Test Size (generally between 0.2 and 0.3); (4) Log File Name; (5) Model File Name; (6) Metric (Optional): Default option is Accuracy.
Download Links:
Class mapping:
https://dsclang.com/download/aps_failure/aps_failure_training_imputed_mapping.pkl
log file:
https://dsclang.com/download/aps_failure/aps_failure_training_imputed.log
Model file:
https://dsclang.com/download/aps_failure/aps_failure_training_imputed.pkl
Note: DSC automatically identifies whether it is a binary classification problem or a multi-classification problem.
The log file contains the complete details of the classification result. Below, we simply display the result of the final ensemble model training:
Top Model based on Accuracy:
Name: Ensemble
Accuracy: 0.99375
ROC AUC: 0.8648025494426695
Precision: 0.8959537572254336
Recall: 0.7311320754716981
F1 Score: 0.8051948051948052
The inferred problem type is âbinary,â as only two unique label values have been observed.
{ânegâ: 0, âposâ: 1}
Confusion Matrix:
[[11770 18]
[ 57 155]]
Fourth Task: Check Classification Model
The fourth task is to use test file and check the accuracy of classification model created in the third task. Let us divide this task into the following steps:
(1) Open aps_failure_test_imputed.csv file
(2) Predict from ensemble classification model by using the model saved in the file âaps_failure_training_imputed.pklâ by ignoring the column âclassâ and put the predicted class in the column âPredicted_classâ.
(3) Save the result to aps_failure_test_result1.csv file.
(4) Open the aps_failure_test_result1.csv file file.
(5) Compute frequency based on the columns class and Predicted_class.
Compute classification metrics based on the columns class and Predicted_class.
open: aps_failure_test_imputed.csv
PREDICT FROM QUICK ENSEMBLE CLASSIFICATION: aps_failure_training_imputed.pkl; class; Predicted_class
save: aps_failure_test_result1.csv
openx: aps_failure_test_result1.csv
compute frequency: class, Predicted_class
compute classification metrics: class; Predicted_class
Download Link:
https://dsclang.com/download/aps_failure/aps_test_check1_result.txt
Confusion Matrix:
[[15608 17]
[ 91 284]]
Metrics:
Accuracy: 0.99325
ROC AUC Score: 0.8781226666666667
Precision: 0.9435215946843853
Recall: 0.7573333333333333
F1 Score: 0.8402366863905324
Cost computed from Confusion Matrix:
Cost = 17*500 + 91*10 = 8500 + 910 = 9410
The cost is smaller than the top contestant.
Our offer:
We are offering the following to everyone who registers at our website www.dsclang.com:
Ă DSC Lite: The Light Version of DSC that can handle 10,000 rows and 20 columns. This is free for a lifetime. However, limitations related to the number of rows and columns may change from time to time. This should be available by 22nd January 2023.
Ă eBook: This book will show how to effectively use DSC. This should be available by mid-February.
Ă DSC: DSC without any limitations, except the memory of your computer, will be available for 15 days.
Ă DSCI: DSC Interactive for 15 days.
Ă Hundreds of case studies on supervised and unsupervised machine learning through our newsletter.
Please click on the https://dsclang.com/register.php to register on our website.