Exciting Breakthrough in Data Science!

Panna Lal Patodia
6 min readJan 10, 2024

--

Breakthrogh in Data Science

Ever wondered if you could build a powerful classification model with a staggering 99.3% accuracy in just 10 lines of code? 🤯 Well, it’s not only possible but a reality now! We are not referring to low code but an entirely new language.

Case Study Unveiling: Dive into our latest case study where we accomplished the seemingly unbelievable task. Witness firsthand how our groundbreaking language, DSC, has revolutionized the landscape of data science and machine learning.

We are excited to introduce a new language DSC, a groundbreaking language designed to enhance the productivity of data scientists and ML engineers by 5 to 6 times. DSC, a novel SQL-like language, caters to data analytics, graph generation, report generation, and both supervised and unsupervised machine learning.

APS Failure at Scania Trucks

Introduction

The dataset is divided into two parts: training and test data. The training data is stored in “aps_failure_training_set.csv,” containing 60,000 rows and 171 columns, many of which have missing values. The test data is stored in “aps_failure_test_set.csv,” consisting of 16,000 rows and 171 columns, and it also contains numerous columns with missing values. Except two columns class and aa_000, all columns have missing values.

You can download both training and test data from the following link: https://www.kaggle.com/datasets/uciml/aps-failure-at-scania-trucks-data-set.

The dataset’s purpose is to create a classification model, and the details of the challenge metric are as follows: Cost-metric of miss-classification:

Cost-metric of miss-classification

where Cost_1 = 10 and cost_2 = 500

The total cost of a prediction model the sum of “Cost_1” multiplied by the number of Instances with type 1 failure and “Cost_2” with the number of instances with type 2 failure, resulting in a “Total_cost”.

In this case Cost_1 refers to the cost that an unnessecary check needs to be done by an mechanic at an workshop, while Cost_2 refer to the cost of missing a faulty truck, which may cause a breakdown.

Total_cost = Cost_1*No_Instances + Cost_2*No_Instances.

First Task: Profile Generation

Let us conduct exploratory data analysis or data profiling on both train data and test data. The profile of training data should be stored in “aps_failure_training_profile.html” with the heading “APS Failure Training Data Profile”. The profile of test data should be stored in “aps_failure_test_profile.html” with the heading “APS Failure Test Data Profile”.

open: aps_failure_training_set.csv
HTML PROFILE: APS Failure Training Data Profile; aps_failure_training_profile.html
openx: aps_failure_test_set.csv
HTML PROFILE: APS Failure Test Data Profile; aps_failure_test_profile.html

Download Links:

https://dsclang.com/download/aps_failure/aps_failure_training_profile.html

https://dsclang.com/download/aps_failure/aps_failure_test_profile.html

The complete details are provided in the profile. Below, we present a summary of the profile:

1) There are no duplicate rows in the training data and test data.

2) In the training data, there are 60,000 rows and 171 columns. Except for the columns “class” and “aa_000,” all columns contain missing values. The percentage of missing values ranges between 0.28 and 82.11.

3) In the test data, there are 16,000 rows and 171 columns. Similar to the training data, except for the columns “class” and “aa_000,” all columns have missing values. The percentage of missing values ranges between 0.18 and 82.06.

4) Given the presence of missing values in 169 columns in both the training and test data, our second task is to impute these missing values. Since all rows are related to different trucks, we are employing “iterative imputing” to achieve high accuracy.

Second Task: Impute Missing Values

As most of the columns in both training and test data have missing values, our second task is to impute missing values. Iterative imputation is often considered better than other imputation methods in certain scenarios due to its ability to handle complex relationships between variables and capture the underlying structure of the data more effectively. In this case, all incidents are independent of each other, so iterative imputation is likely to provide better results.

Details of Methods available for Iterative Imputing in DSC:

(1) ICE — Iterative imputation method based on regularized linear regression.

(2) HyperImpute — Iterative imputer using both regression and classification methods based on XGBoost, CatBoost and neural nets.

(3) MissForest — Iterative imputation method based on Random Forests.

(4) xHyperImpute — Iterative imputer based on using both regression and classification methods based on linear models.

The task is to impute missing values using xHyperImpute method, by ignoring column ‘class’ and save training data to “aps_failure_training_imputed.csv” and test data to “aps_failure_test_imputed.csv” files.

open: aps_failure_training_set.csv
ITERATIVE IMPUTER: xHyperImpute; class; aps_failure_training_imputed.csv
// Time Taken: 843.072 seconds
openx: aps_failure_test_set.csv
ITERATIVE IMPUTER: xHyperImpute; class; aps_failure_test_imputed.csv

Note: We are unable to share imputed values as the data license does not permit us to do so.

Third Task: Create Classification Model

Our third task is to create classifications model using ensemble of various method wherein the target is ‘class’. While creating classification model, we use 80% data for training and 20% of data for testing by using all the columns except ‘class’. We should store the results to “aps_failure_training_imputed.log” file and model to “aps_failure_training_imputed.pkl” file.

open: aps_failure_training_imputed.csv
ENSEMBLE CLASSIFICATION: class; ; 0.2; aps_failure_training_imputed.log; aps_failure_training_imputed.pkl

Syntax — ENSEMBLE CLASSIFICATION: (1) Trarget Colum Name; (2) Comma-Separated list of columns to be ignored; (3) Test Size (generally between 0.2 and 0.3); (4) Log File Name; (5) Model File Name; (6) Metric (Optional): Default option is Accuracy.

Download Links:

Class mapping:

https://dsclang.com/download/aps_failure/aps_failure_training_imputed_mapping.pkl

log file:

https://dsclang.com/download/aps_failure/aps_failure_training_imputed.log

Model file:

https://dsclang.com/download/aps_failure/aps_failure_training_imputed.pkl

Note: DSC automatically identifies whether it is a binary classification problem or a multi-classification problem.

The log file contains the complete details of the classification result. Below, we simply display the result of the final ensemble model training:

Top Model based on Accuracy:

Name: Ensemble

Accuracy: 0.99375

ROC AUC: 0.8648025494426695

Precision: 0.8959537572254336

Recall: 0.7311320754716981

F1 Score: 0.8051948051948052

The inferred problem type is ‘binary,’ as only two unique label values have been observed.

{‘neg’: 0, ‘pos’: 1}

Confusion Matrix:

[[11770 18]

[ 57 155]]

Classification Report

Fourth Task: Check Classification Model

The fourth task is to use test file and check the accuracy of classification model created in the third task. Let us divide this task into the following steps:

(1) Open aps_failure_test_imputed.csv file

(2) Predict from ensemble classification model by using the model saved in the file “aps_failure_training_imputed.pkl” by ignoring the column “class” and put the predicted class in the column “Predicted_class”.

(3) Save the result to aps_failure_test_result1.csv file.

(4) Open the aps_failure_test_result1.csv file file.

(5) Compute frequency based on the columns class and Predicted_class.

Compute classification metrics based on the columns class and Predicted_class.

open: aps_failure_test_imputed.csv
PREDICT FROM QUICK ENSEMBLE CLASSIFICATION: aps_failure_training_imputed.pkl; class; Predicted_class
save: aps_failure_test_result1.csv
openx: aps_failure_test_result1.csv
compute frequency: class, Predicted_class
compute classification metrics: class; Predicted_class

Download Link:

https://dsclang.com/download/aps_failure/aps_test_check1_result.txt

Confusion Matrix:

[[15608 17]

[ 91 284]]

Metrics:

Accuracy: 0.99325

ROC AUC Score: 0.8781226666666667

Precision: 0.9435215946843853

Recall: 0.7573333333333333

F1 Score: 0.8402366863905324

Classification Report

Cost computed from Confusion Matrix:

Cost = 17*500 + 91*10 = 8500 + 910 = 9410

The cost is smaller than the top contestant.

Our offer:

We are offering the following to everyone who registers at our website www.dsclang.com:

Ø DSC Lite: The Light Version of DSC that can handle 10,000 rows and 20 columns. This is free for a lifetime. However, limitations related to the number of rows and columns may change from time to time. This should be available by 22nd January 2023.

Ø eBook: This book will show how to effectively use DSC. This should be available by mid-February.

Ø DSC: DSC without any limitations, except the memory of your computer, will be available for 15 days.

Ø DSCI: DSC Interactive for 15 days.

Ø Hundreds of case studies on supervised and unsupervised machine learning through our newsletter.

Please click on the https://dsclang.com/register.php to register on our website.

--

--

Panna Lal Patodia
Panna Lal Patodia

Written by Panna Lal Patodia

CEO of Patodia Infotech Private Limited.

No responses yet