那就开始吧(吧).
Anaconda
可以理解成模拟环境,可以用于跑不同的python版本.
conda - A command-line utility for package and environment management.
upgrade
1 | conda upgrade conda |
add package
1 | conda install PACKAGE_NAME |
managing environments
真正用来跑python的环境.
1 | conda create -n env_name [python=X.X] [LIST_OF_PACKAGES] |
For example, to create an environment named my_env with Python 3.7, and install NumPy and Keras in it, use the command below.
1 | conda create -n my_env python=3.7 numpy Keras |
1 | conda create -n py3_env python=3 |
entering (Activate) an environment
1 | # For conda 4.6 and later versions on Linux/macOS/Windows, use |
大概是用create先装好一个环境,什么时候用都可以进去的.不是一次性的.
换environment
Save all the above information in the current environment to a YAML file, environment.yaml, and later share this file with other users over GitHub or other means.
This file will get created (or overwritten) in your current directory.
1 | conda env export > environment.yaml |
To create an environment from an environment file
1 | conda env create -f environment.yaml |
如果忘了自己的environment叫什么名字.
1 | conda env list |
list the packages inside an environment
1 | # If the environment is not activated |
removing an environment
1 | conda env remove -n env_name |
Jupyter Notebooks
The notebook is a web application that allows you to combine explanatory text, math equations, code, and visualizations all in one easily sharable document.
我觉得蛮强的!
install
1 | conda install jupyter notebook |
run
1 | jupyter notebook |
add package
1 | conda install nb_conda |
Since notebooks are JSON, it is simple to convert them to other formats. Jupyter comes with a utility called nbconvert for converting to HTML, Markdown, slideshows, etc.
The general syntax to convert a given mynotebook.ipynb file to another FORMAT is:
1 | jupyter nbconvert --to FORMAT mynotebook.ipynb |
The currently supported output FORMAT could be either of the following (ignore case):
- HTML,
- LaTeX,
- PDF,
- WebPDF,
- Reveal.js HTML slideshow,
- Markdown,
- Ascii,
- reStructuredText,
- executable script,
- notebook.
For example, to convert a notebook to an HTML file, in your terminal use
1 | # Install the package below, if not already |
导出:
常用之一的pdf需要装插件.documentation.
常用之二的可以在View → Cell Toolbar → Slideshow设置并倒出网页版presentation.
1 | jupyter nbconvert presentation.ipynb --to slides |
加上附录的toggle文件可以隐藏代码.应该是放在跟ipynb同folder的地方.
1 | jupyter nbconvert Your_Slide_Deck_Name.ipynb --to slides --post serve --template output_toggle |
顺便一提在ipynb文件里也可以执行:
1 | !jupyter nbconvert Your_Slide_Deck_Name.ipynb --to slides --post serve --template output_toggle |
用Jupyter Notebooks和pandas读CSV
Read
read_csv() is used to load data from csv files into a Pandas dataframe.
1 | import pandas as pd |
CSV stands for comma separated values - but they can actually be separated by different characters, tabs, white space, etc.
当CSV的格式跟default不同,得重新定义seperator.
1 | df = pd.read_csv('student_scores.csv', sep=':') |
Another thing you can do with read_csv is specify which line of the file is the header, which specifies the column labels.
1 | df = pd.read_csv('student_scores.csv', header=2) |
If columns labels are not included in your file, use header=None to prevent your first line of data from being misinterpreted as column labels.
1 | df = pd.read_csv('student_scores.csv', header=None) |
换column名字. Specify your own column labels.
1 | labels = ['id', 'name', 'attendance', 'hw', 'test1', 'project1', 'test2', 'project2', 'final'] |
If you want to tell pandas that there was a header line that you are replacing, specify the row of that line like this.
1 | labels = ['id', 'name', 'attendance', 'hw', 'test1', 'project1', 'test2', 'project2', 'final'] |
Instead of using the default index (integers incrementing by 1 from 0), specify one or more of your columns to be the index of your dataframe.
1 | df = pd.read_csv('student_scores.csv', index_col='Name') |
1 | df = pd.read_csv('student_scores.csv', index_col=['Name', 'ID']) |
Read formats
More helpful methods for assessing and building intuition about a dataset.
1 | import pandas as pd |
1 | # this returns a tuple of the dimensions of the dataframe |
同时也可以这样count rows and columns.
1 | rows = len(df.axes[0]) |
找某一列的数据.
1 | # sales on march 13, 2016 |
1 | # this returns the datatypes of the columns |
1 | # although the datatype for diagnosis appears to be object, further |
Pandas actually stores pointers to strings in dataframes and series, which is why object instead of str appears as the datatype.
1 | # this displays a concise summary of the dataframe, |
check null function,isnull()
doc
1 | # view missing value count for each feature in 2008 |
1 | # this returns the number of unique values in each column |
找某一个column可以df.education(column名字).nunique()
Describe
1 | # this returns useful descriptive statistics for each column of data |
Return count, mean, std, min, 25%, 50%, 75%, max of the data.
Median
1 | # get the median amount of alcohol content |
1 | # this returns the first few lines in our dataframe |
1 | # View the index number and label for each column |
Loc and iloc
We can select data using loc and iloc, which you can read more about here.
We can use .loc, iloc, and ix to insert the rows in the DataFrame.
The loc basically works for the labels of our index. It can be understood as if we insert in loc[4], which means we are looking for that values of DataFrame that have an index labeled 4.
The iloc basically works for the positions in the index. It can be understood as if we insert in iloc[4], which means we are looking for the values of DataFrame that are present at index ‘4’.
The ix is a complex case because if the index is integer-based, we pass a label to ix. The ix[4] means that we are looking in the DataFrame for those values that have an index labeled 4. However, if the index is not only integer-based, ix will deal with the positions as iloc.
1 | # select all the columns from 'id' to the last mean column |
计算某几行的sum.
1 | # total sales for the last month |
关于sum,再在这里写一个.
用上面所描述的方式,会把categorical的data也sum起来,就有点尴尬.
我们可以做找出所需要条件的column再用数column的方式得到想要的sum.
1 | print (len(admits[admits['gender']=='female'])) |
在admits这个csv中,gender一行是female的总和.
1 | len(admits[(admits['gender']=='female') & (admits['admitted'])]) |
两个条件就用add.
Clean data
common problem:
Missing data(null)
我们可以通过对比df.info()来发现
要结合missing的原因和是否是随机来考虑,可以通过填充mean的方式来解决.
1 | mean = df['name_of_column'].mean() |
Check data
this also can use in visualization.
1 | # check value counts for the 2008 cyl column |
注:value_counts是用于记述此column中所有的数值种类与数量.
这个function的作用是returns data type和不同值的数量.
Duplicate
Check duplicate.
1 | df.duplicated() |
当要考虑我们拥有同样的id但是不同数据的data时,我们用id来check duplicate.
Drop extraneous columns
drop的其他用法.doc
1 | # drop columns from 2008 dataset |
doc, drop missing value
1 | # Drop the rows where at least one element is missing. |
Incorrect data type
时间格式不对.
1 | df['timestamp'] = pd.to_datatime(df['timestamp']) |
1 | # Extract int from strings in the 2008 cyl column |
Renaming columns
在这里例子中还有合并(merge)data.
1 | #最基本的是一对一的改名 |
1 | # assign new labels to columns in dataframe |
1 | # save this for later |
一个column分成两个column
在这个例子中,数据的格式是’num_1/num_2’或者‘num_1’
1 | # Get all the hybrids in 2008 |
Then use apply function with pandas.
1 | # columns to split by \"/\" |
Visualizing data
documentation
using matplotlib.
1 | import matplotlib.pyplot as plt |
1 | df.hist(figsize=(8,8)); |
count each distinct value or bar
1 | df['name_of_column'].value_counts() |
1 | pd.plotting.scatter_martix(df, figsize=(15,15)); |
Appending and numpy
将两个CSV合并到一起分析
numpy, since numpy built in C, it is faster than python.
1 | import numpy as np |
we work with numpy for multi-dimensional array of values
Aside:
1 | # generate an array of 100 million random floats bewteen zero and one. |
Create columns
using numpy repeat
1 | color_red = np.repeat('red', red_df.shape[0]) |
here, shape[0] to find numbers of columns.
then we add array to original dataframe.
1 | red_df['color'] = color_red |
use pandas combine to combine two dataframes.
1 | wine_df = red_df.append(white_df) |
save combine dataframe.
1 | wine_df.to_csv('winequality_edited.csv', index=False) |
troubleshooting with appending:
when names doesnot match, data will go wrong.
we rename our data:
1 | df=df.rename(columns = {'two':'new_name'}) |
Group by
用pandas的group by以某一个column来分类讨论其他的数据.
intro
doc
1 | # 在不同的quality下讨论各column的mean |
Cut
from pandas,我们把一个column的数据作为cut的分类标准.
doc
in this example, we would like to cut pH from
High: Lowest 25% of pH values
Moderately High: 25% - 50% of pH values
Medium: 50% - 75% of pH values
Low: 75% - max pH value
so we get this data from df.describe().pH
and manually plugin in in cut function
“count 6497.000000\n”,
“mean 3.218501\n”,
“std 0.160787\n”,
“min 2.720000\n”,
“25% 3.110000\n”,
“50% 3.210000\n”,
“75% 3.320000\n”,
“max 4.010000\n”,
“Name: pH, dtype: float64”
1 | # Bin edges that will be used to \"cut\" the data into groups |
Query
from pandas
doc
query是用于筛选符合条件的columns的.
1 | # selecting malignant records in cancer data |
1 | # select samples with alcohol content less than the median |
用query加上count也可以有sum的效果.
Merge
pandas
四种不同的合并:
- Inner Join - Use intersection of keys from both frames.
- Outer Join - Use union of keys from both frames.
- Left Join - Use keys from left frame only.
- Right Join - Use keys from right frame only.
1
2# merge df_08和df_18
df_combined = df_08.merge(df_18, left_on='model_2008', right_on='model', how='inner')