关于data的笔记 - Week1 | Underlake Reading Room

那就开始吧(吧).

Anaconda

可以理解成模拟环境,可以用于跑不同的python版本.

conda - A command-line utility for package and environment management.

upgrade

1 2	conda upgrade conda conda upgrade --all

add package

1
2
3

conda install PACKAGE_NAME
conda remove PACKAGE_NAME
conda update PACKAGE_NAME

managing environments

真正用来跑python的环境.

1	conda create -n env_name [python=X.X] [LIST_OF_PACKAGES]

For example, to create an environment named my_env with Python 3.7, and install NumPy and Keras in it, use the command below.

1	conda create -n my_env python=3.7 numpy Keras

1 2	conda create -n py3_env python=3 conda create -n py2_env python=2

entering (Activate) an environment

# For  conda 4.6 and later versions on Linux/macOS/Windows, use
conda activate my_env
#For conda versions prior to 4.6 on Linux/macOS, use
source activate my_env
#For conda versions prior to 4.6 on Windows, use
activate my_env
# For  conda 4.6 and later versions on Linux/macOS/Windows, use
conda deactivate
#For conda versions prior to 4.6 on Linux/macOS, use
source deactivate
#For conda versions prior to 4.6 on Windows, use
deactivate

大概是用create先装好一个环境,什么时候用都可以进去的.不是一次性的.

换environment

Save all the above information in the current environment to a YAML file, environment.yaml, and later share this file with other users over GitHub or other means.
This file will get created (or overwritten) in your current directory.

1	conda env export > environment.yaml

To create an environment from an environment file

1	conda env create -f environment.yaml

如果忘了自己的environment叫什么名字.

1
2
3

conda env list
//或者
conda info --envs

list the packages inside an environment

# If the environment is not activated
conda list -n env_name

# If the environment is activated
conda list

# To see if a specific package, say `scipy` is installed in an environment
conda list -n env_name scipy

removing an environment

1	conda env remove -n env_name

Jupyter Notebooks

The notebook is a web application that allows you to combine explanatory text, math equations, code, and visualizations all in one easily sharable document.
我觉得蛮强的!

install

1	conda install jupyter notebook

run

1	jupyter notebook

add package

1	conda install nb_conda

Since notebooks are JSON, it is simple to convert them to other formats. Jupyter comes with a utility called nbconvert for converting to HTML, Markdown, slideshows, etc.
The general syntax to convert a given mynotebook.ipynb file to another FORMAT is:

1	jupyter nbconvert --to FORMAT mynotebook.ipynb

The currently supported output FORMAT could be either of the following (ignore case):

HTML,
LaTeX,
PDF,
WebPDF,
Reveal.js HTML slideshow,
Markdown,
Ascii,
reStructuredText,
executable script,
notebook.

For example, to convert a notebook to an HTML file, in your terminal use

1
2
3

# Install the package below, if not already
pip install nbconvert
jupyter nbconvert --to html mynotebook.ipynb

导出:

常用之一的pdf需要装插件.documentation.
常用之二的可以在View → Cell Toolbar → Slideshow设置并倒出网页版presentation.

1	jupyter nbconvert presentation.ipynb --to slides

加上附录的toggle文件可以隐藏代码.应该是放在跟ipynb同folder的地方.

1	jupyter nbconvert Your_Slide_Deck_Name.ipynb --to slides --post serve --template output_toggle

顺便一提在ipynb文件里也可以执行:

1	!jupyter nbconvert Your_Slide_Deck_Name.ipynb --to slides --post serve --template output_toggle

用Jupyter Notebooks和pandas读CSV

Read

read_csv() is used to load data from csv files into a Pandas dataframe.

1
2
3

import pandas as pd

df = pd.read_csv('student_scores.csv')

CSV stands for comma separated values - but they can actually be separated by different characters, tabs, white space, etc.
当CSV的格式跟default不同,得重新定义seperator.

1 2	df = pd.read_csv('student_scores.csv', sep=':') df.head()

Another thing you can do with read_csv is specify which line of the file is the header, which specifies the column labels.

1 2	df = pd.read_csv('student_scores.csv', header=2) df.head()

If columns labels are not included in your file, use header=None to prevent your first line of data from being misinterpreted as column labels.

1 2	df = pd.read_csv('student_scores.csv', header=None) df.head()

换column名字. Specify your own column labels.

1
2
3

labels = ['id', 'name', 'attendance', 'hw', 'test1', 'project1', 'test2', 'project2', 'final']
df = pd.read_csv('student_scores.csv', names=labels)
df.head()

If you want to tell pandas that there was a header line that you are replacing, specify the row of that line like this.

1
2
3

labels = ['id', 'name', 'attendance', 'hw', 'test1', 'project1', 'test2', 'project2', 'final']
df = pd.read_csv('student_scores.csv', header=0, names=labels)
df.head()

Instead of using the default index (integers incrementing by 1 from 0), specify one or more of your columns to be the index of your dataframe.

1 2	df = pd.read_csv('student_scores.csv', index_col='Name') df.head()

1 2	df = pd.read_csv('student_scores.csv', index_col=['Name', 'ID']) df.head()

Read formats

More helpful methods for assessing and building intuition about a dataset.

import pandas as pd

df = pd.read_csv('cancer_data.csv')
df.head()

1
2
3

# this returns a tuple of the dimensions of the dataframe
# 也就是可以找rows and columns
df.shape

同时也可以这样count rows and columns.

rows = len(df.axes[0])
cols = len(df.axes[1])

print("Number of Rows: " + str(rows))
print("Number of Columns: " + str(cols))

找某一列的数据.

# sales on march 13, 2016
df[df['week'] == '2016-03-13']

# worst week for store C
df[df['storeC'] == df['storeC'].min()]

1 2	# this returns the datatypes of the columns df.dtypes

1
2
3

# although the datatype for diagnosis appears to be object, further
# investigation shows it's a string
type(df['diagnosis'][0])

Pandas actually stores pointers to strings in dataframes and series, which is why object instead of str appears as the datatype.

1
2
3

# this displays a concise summary of the dataframe,
# including the number of non-null values in each column
df.info()

check null function,isnull()
doc

# view missing value count for each feature in 2008
df_08.isnull().sum()
# checks if any of columns in 2008 have null values - should print False
df_08.isnull().sum().any()

1 2	# this returns the number of unique values in each column df.nunique()

找某一个column可以df.education(column名字).nunique()

Describe

1 2	# this returns useful descriptive statistics for each column of data df.describe()

Return count, mean, std, min, 25%, 50%, 75%, max of the data.

Median

1 2	# get the median amount of alcohol content df.alcohol.median()

# this returns the first few lines in our dataframe
# by default, it returns the first five
df.head()

# although, you can specify however many rows you'd like returned
df.head(20)

# same thing applies to `.tail()` which returns the last few rows
df.tail(2)

1
2
3

# View the index number and label for each column
for i, v in enumerate(df.columns):
    print(i, v)

Loc and iloc

We can select data using loc and iloc, which you can read more about here.

We can use .loc, iloc, and ix to insert the rows in the DataFrame.

The loc basically works for the labels of our index. It can be understood as if we insert in loc[4], which means we are looking for that values of DataFrame that have an index labeled 4.
The iloc basically works for the positions in the index. It can be understood as if we insert in iloc[4], which means we are looking for the values of DataFrame that are present at index ‘4’.
The ix is a complex case because if the index is integer-based, we pass a label to ix. The ix[4] means that we are looking in the DataFrame for those values that have an index labeled 4. However, if the index is not only integer-based, ix will deal with the positions as iloc.

# select all the columns from 'id' to the last mean column
df_means = df.loc[:,'id':'fractal_dimension_mean']
df_means.head()

# repeat the step above using index numbers
df_means = df.iloc[:,:12]
df_means.head()

计算某几行的sum.

# total sales for the last month
df.iloc[196:, 1:].sum()

# total sales during most recent 3 month period
last_three_months = df[df['week'] >= '2017-12-01']
last_three_months.iloc[:, 1:].sum()  # exclude sum of week column

关于sum,再在这里写一个.
用上面所描述的方式,会把categorical的data也sum起来,就有点尴尬.
我们可以做找出所需要条件的column再用数column的方式得到想要的sum.

1	print (len(admits[admits['gender']=='female']))

在admits这个csv中,gender一行是female的总和.

1	len(admits[(admits['gender']=='female') & (admits['admitted'])])

两个条件就用add.

Clean data

common problem:

Missing data(null)

我们可以通过对比df.info()来发现
要结合missing的原因和是否是随机来考虑,可以通过填充mean的方式来解决.

mean = df['name_of_column'].mean()
df['name_of_column'] = df['name_of_column'].fillna(mean)# assign to the old one
# or
df.fillna(df.mean(), inplace=True)

Check data

this also can use in visualization.

1 2	# check value counts for the 2008 cyl column df_08['cyl'].value_counts()

注:value_counts是用于记述此column中所有的数值种类与数量.
这个function的作用是returns data type和不同值的数量.

Duplicate

Check duplicate.

df.duplicated()
# or we can count it
sum(df.duplicated())

#drop duplicate
df.drop_duplicates(inplace=True)

当要考虑我们拥有同样的id但是不同数据的data时,我们用id来check duplicate.

Drop extraneous columns

drop的其他用法.doc

1 2	# drop columns from 2008 dataset df_08.drop(['Stnd', 'Underhood ID', 'FE Calc Appr', 'Unadj Cmb MPG'], axis=1, inplace=True)

doc, drop missing value

# Drop the rows where at least one element is missing.
df.dropna(inplace=True)
# Drop the rows where all elements are missing.
df.dropna(how='all')

Incorrect data type

时间格式不对.

1	df['timestamp'] = pd.to_datatime(df['timestamp'])

Convert int from string.

# Extract int from strings in the 2008 cyl column
df_08['cyl'] = df_08['cyl'].str.extract('(\d+)').astype(int)

# convert 2018 cyl column to int
# 转int,小数点就会消失(4.0 -> 4)
df_18['cyl'] = df_18['cyl'].astype(int)
#如果要转float就是astype(float)

Renaming columns

在这里例子中还有合并(merge)data.

#最基本的是一对一的改名
df_clean = df_clean.rename(columns={'oldName1': 'newName1',
                                    'oldName2': 'newName2'})

# 用append修改同个格式的名字,remove "_mean" from column names
new_labels = []
for col in df.columns:
    if '_mean' in col:
        new_labels.append(col[:-5])  # exclude last 6 characters
    else:
        new_labels.append(col)

# new labels for our columns
new_labels

# replace spaces with underscores and lowercase labels for 2008 dataset
df_08.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)

# another rename example,在所有的column名背后加2008
df_08.rename(columns=lambda x: x[:10] + "2008", inplace=True)

# confirm column labels for 2008 and 2018 datasets are identical
# 当我们要对两个dataframe做合并,先改名,再确认column名一样.
df_08.columns == df_18.columns

# make sure they're all identical like this
(df_08.columns == df_18.columns).all()

# assign new labels to columns in dataframe
df.columns = new_labels

# display first few rows of dataframe to confirm changes
df.head()

1 2	# save this for later df.to_csv('cancer_data_edited.csv', index=False)

一个column分成两个column

在这个例子中,数据的格式是’num_1/num_2’或者‘num_1’

# Get all the hybrids in 2008
hb_08 = df_08[df_08['fuel'].str.contains('/')]
# Hybrids in 2018
hb_18 = df_18[df_18['fuel'].str.contains('/')]

df1 = hb_08.copy() # data on first fuel type of each hybrid vehicle
df2 = hb_08.copy() # data on second fuel type of each hybrid vehicle

Then use apply function with pandas.

str.contains

# columns to split by \"/\"
split_columns = ['fuel', 'air_pollution_score', 'city_mpg', 'hwy_mpg', 'cmb_mpg', 'greenhouse_gas_score']
# apply split function to each column of each dataframe copy
for c in split_columns:
    df1[c] = df1[c].apply(lambda x: x.split("/")[0])
    df2[c] = df2[c].apply(lambda x: x.split("/")[1])
# combine dataframes to add to the original dataframe
new_rows = df1.append(df2)
# drop the original hybrid rows
df_08.drop(hb_08.index, inplace=True)
# add in our newly separated rows
df_08 = df_08.append(new_rows, ignore_index=True)
# check that all the original hybrid rows with \"/\"s are gone
df_08[df_08['fuel'].str.contains('/')]

Visualizing data

documentation
using matplotlib.

1 2	import matplotlib.pyplot as plt % matplotlib inline

1
2
3

df.hist(figsize=(8,8));
df['name_of_column'].hist();
df['name_of_column'].plot(kind='hist');

count each distinct value or bar

df['name_of_column'].value_counts()

df['name_of_column'].value_counts().plot(kind='bar') # create bar chart

df['name_of_column'].value_counts().plot(kind='pie', figsize=(8,8))

pd.plotting.scatter_martix(df, figsize=(15,15));

df.plot(x='x-axis', y='y-axis', kind='scatter');

df['name_of_column'].plot(kind='box');

Appending and numpy

将两个CSV合并到一起分析

numpy, since numpy built in C, it is faster than python.

1	import numpy as np

we work with numpy for multi-dimensional array of values

Aside:

1 2	# generate an array of 100 million random floats bewteen zero and one. a = np.random.random(int(1e8))

Create columns

using numpy repeat

1	color_red = np.repeat('red', red_df.shape[0])

here, shape[0] to find numbers of columns.

then we add array to original dataframe.

1	red_df['color'] = color_red

use pandas combine to combine two dataframes.

1	wine_df = red_df.append(white_df)

save combine dataframe.

1	wine_df.to_csv('winequality_edited.csv', index=False)

troubleshooting with appending:
when names doesnot match, data will go wrong.

we rename our data:

1	df=df.rename(columns = {'two':'new_name'})

Group by

用pandas的group by以某一个column来分类讨论其他的数据.
intro
doc

# 在不同的quality下讨论各column的mean
wine_df.groupby('quality').mean()

# 也可以用两个来group,先group by quality,再分着讨论color(只看pH选项)
wine_df.groupby(['quality','color'], as_index=False )['pH'].mean()

#用group by创造新的dataframe with new column
model_mpg = df.groupby('model').mean()[['cmb_mpg_2008', 'cmb_mpg']]

Cut

from pandas,我们把一个column的数据作为cut的分类标准.
doc

in this example, we would like to cut pH from

High: Lowest 25% of pH values
Moderately High: 25% - 50% of pH values
Medium: 50% - 75% of pH values
Low: 75% - max pH value

so we get this data from df.describe().pH
and manually plugin in in cut function

“count 6497.000000\n”,
“mean 3.218501\n”,
“std 0.160787\n”,
“min 2.720000\n”,
“25% 3.110000\n”,
“50% 3.210000\n”,
“75% 3.320000\n”,
“max 4.010000\n”,
“Name: pH, dtype: float64”

# Bin edges that will be used to \"cut\" the data into groups
bin_edges = [2.72, 3.11, 3.21, 3.32, 4.01]

# Labels for the four acidity level groups
bin_names = ['high', 'mod_high', 'medium', 'low'] # Name each acidity level category

# Creates acidity_levels column
df['acidity_levels'] = pd.cut(df['pH'], bin_edges, labels=bin_names)

# Checks for successful creation of this column
df.head()

Query

from pandas
doc
query是用于筛选符合条件的columns的.

# selecting malignant records in cancer data
df_m = df[df['diagnosis'] == 'M']
df_m = df.query('diagnosis == "M"')

# selecting records of people making over $50K
df_a = df[df['income'] == ' >50K']
df_a = df.query('income == " >50K"')

# for examples filtered columns containing strings

# selecting records in cancer data with radius greater than the median
df_h = df[df['radius'] > 13.375]
df_h = df.query('radius > 13.375')

# select samples with alcohol content less than the median
low_alcohol = df.query('alcohol < 10.3')

# select samples with alcohol content greater than or equal to the median
high_alcohol = df.query('alcohol >= 10.3')

# ensure these queries included each sample exactly once
num_samples = df.shape[0]

num_samples == low_alcohol['quality'].count() + high_alcohol['quality'].count() # should be True

用query加上count也可以有sum的效果.

Merge

pandas
四种不同的合并:

Inner Join - Use intersection of keys from both frames.
Outer Join - Use union of keys from both frames.
Left Join - Use keys from left frame only.

Right Join - Use keys from right frame only.

1 2	# merge df_08和df_18 df_combined = df_08.merge(df_18, left_on='model_2008', right_on='model', how='inner')