Here we are able to declare all of our category variables in a class. Good thing in SAS is that for categorical variables, we don’t need to create a dummy variable. Model Survived(event=’1') = Age Fare Embarked Parch Pclass Sex SibSp / In our case, the target variable is survived.Ĭlass Embarked Parch Pclass Sex SibSp Survived Logistic regression is perfect for building a model for a binary variable. I am now creating a logistic regression model by using proc logistic. We filled all our missing values and our dataset is ready for building a model. Let us also perform quick set processing in order to leave only the columns that are interesting for us and name variables properly. The Selected variable with the value of 1 will our target observation of the training part. In order to verify the correct data partition, I am generating a frequency table by using proc freq. Proc surveyselect data = train_sorted out = train_survey outall Proc sort data = ain6 out = train_sorted * Splitting the dataset into traning and validation using 70:30 ratio */ First, I need to sort out the data using proc sort and splitting by using proc surveyselect. Splitting the dataset into training and validation by using the 70:30 ratio. (Selected median due to category variable). I have dropped the cabin variable as I don’t see it is going to impact our model and filled the missing value in ‘embarked’ using the median. * Imputing Mean value for the age column */Įlse if age = “.” and Pclass = 2 then age = 29 Įlse if age = “.” and Pclass = 3 then age = 24 We’ll use these average age values to impute based on Pclass for Age. We can see the wealthier passengers in the higher classes tend to be older, which makes sense. * Sorting out the Pclass and Age for creating boxplot */ In SAS, we need to sort it out of the class and age variable before making it a box plot. However, we can check the average age by passenger class using a box plot. We need to fill all missing age instead of dropping the missing rows. We have missing value in Age, Embarked and Cabin. Tables Survived tables Sex tables Pclass tables SibSp tables Parch tables Embarked tables Cabin Title “Frequency tables for categorical variables in the training set” We can see that Age has 177 missing values and no outliers detected. Proc means data=ain N Nmiss mean std min P1 P5 P10 P25 P50 P75 P90 P95 P99 max * Checking the missing value and Statistics of the dataset */ I am not going into detail.Ĭhecking the missing values by using proc means Still, there are many ways to visualize the data. People travelled in class 3 died the most. Here, we see a trend that more females survived than males. Let’s analyze survived the rate with other variables. Nothing unusual can be seen in value distributions. Title "Analysis of embarkation locations" I am using proc sgplot to visualize the class, Embark. Normally, it is good practice to research with the data by using visualization. ![]() Label Embarked = "Passenger Embarking Port" Vbar sex / group=Survived stat=percent missing Proc sgplot data=prac.titanic pctlevel=group We can clearly see that 342 people were survived and 549 people are not survived. * Checking the frequency of the Target Variable Survived */ Text variable: Ticket and NameĬhecking the frequency of the target variable ‘Survived’ by using proc frequency Numeric Variables: Passenger ID, SibSp, Parch, Survived, Age and Fare. Our target variable is ‘Survived’ which has 1 and 0. Proc import datafile = “C:/dev/projects/sas/pracdata/train.csv”Ĭhecking the contents of the dataset by using proc contents function * Importing dataset using proc import */ Setting the library path and importing the dataset using proc import I am using SAS Enterprise guide to analyze this dataset. Here, we will try to predict the classification - Survived or deceased. I am using Titanic dataset from which contains a training and test dataset. Building a Logistic Model by using SAS Enterprise Guide
0 Comments
Leave a Reply. |