Raw data

download raw data download raw data download raw data download raw data

Visualization before data cleaning

Note: Columns that most data are missing are removed. Missing values on 'state' column are filled with common knowledge. Missing values on 'make_the_world_better_percent' column are replaced with average value of other rows since only few values are missing.

Check potential outliers

Note: In the first graph, there are some values not in normal range for private schools. However, since in real life, there are some universities are hard to get in, it is understandable that there are schools have really low accept-rate. In the second graph, there are no potential outliers.

Datasets after cleaning

Note: Image 4 is a new dataset combined by other three datasets.

text data after data cleaning

Visualization after data cleaning

Note:'state' and 'make_the_world_better_percent' columns are affect by data cleaning.

R codes (record data)

Python codes (text data)