R

Outliers-Part 1:Causes, Philosophy and General Rules

1 What are Outliers? 2 Causes for Outliers 3 Types of Outliers 4 Philosophy about Finding Outliers 5 General Rules Figure 0.1: Outliers 4 years ago (Yes, back to 2016), I was asked by a director of data science department from a very famous IT company about outliers. Basically, she asked two questions: What are outliers? How to detect them? Also in my daily research life, I have encountered noisy data all the time.

Utilizing a Logistic Regression Model to Predict Credit Card Default

1 Data 2 Income, Balance & Default 3 Model Selection 4 Diagnosis 5 Interesting Points 6 Model Cross-Validation 7 Parameter Selection 8 Conclusion Logistic regression model is widely used for group classification. In education or social science, it has been used to classify students/individuals to different groups. In the finance industry, logistic regression model is also quite useful to identify/classify individual’s group status (i.e. Y) according his/her other features (i.

Introducing Tidyverse-Part 2: %>%, the Forward Chaining

Figure 0.1: Pipe Operator Instead of introducing tidyr and dplyr packages-two most essential R packages for data wrangling, I would like to insert a side topic that I think it’s worth to mention for R programming efficiency as my 2nd Tidyverse blog 1. To me, this important programming command completely changes my view of programming and reshapes my programming habit since I used it. This magic command is %>%, a.

Introducing Tidyverse-Part 1:Tidy Data

In an ideal world, a data analysis process is as simple as-read in data, select a suitable model to fit in data, obtain statistical estimates, and finally, interpret the analysis results. Sounds simple and straight forward, isn’t it? But, in reality, it’s often not that simple!Data is always messy and often times we need to clean our data before we can make any sense of it. Moreover, some researchers found that more than 80% of data analysis is actually spent on data preparation or data manipulation (Dasu & Johnson, 2003), so that the data is transformed into a usable format before you even think about analysis.

Load “rJava” package on R

Recently, some R users (include myself) find that they are unable to load ‘rJava’ package in R. This creates a big headache to the regular R user, since ‘rJava’ package is a core dependent package for most of txt processing R packages, like xlsx, tabulizer, Rwordseg, etc.). It is crucial to fix this issue and load ‘rJava’ package on R successfully. Therefore, I’ve spent some time to find a way to make it work.

Parallel Analysis: Determining the Dimensionality of Data

1 WHAT IS PARALLEL ANALYSIS 2 METHODOLOGY 3 PARALLEL ANALYSIS IN R Recently, my colleague asked me to review a state assessment tech report. In the tech report, a section-“Parallel Analysis” really caught my eyes. I have done parallel analysis multiple times in the past. However, I have never thought about this topic in a systematical way. It is always a good memory refreshing opportunity. Therefore, I take this tech report reviewing opportunity to dig down the parallel analysis.

Bad Habits in R programming from Nice-R-Code site

Recently, I’ve prepared/refreshed my old programming talk materials. It was a talk I gave internally to all the summer interns in 2018. I found an very interesting and good post from the web. This web post gives a lot of good and bad programming examples. I thought it might be worthwhile for me to share the post content here so that people including myself could be benefited from it in the future.