0%

关于data的笔记 - Week2

两周的笔记一起摸,不愧是我.

DA的过程

question -> wrangle -> explore -> draw conclusion -> communicate

wrangle:

  • gather data
  • assess data to identify problems
  • clean data(modify, replace, move to ensure data is high quality and well-structured)

explore:

  • finding patterns
  • visualizing relationship
  • building intuition
  • remove outliers
  • create new and more descriptive features

draw conclusion:

  • and make prediction

communicate:

  • 也是最重要的(…)
  • data visualization is very valuable

more CSV with Jupyter Notebooks

matplotlib

关于用matplotlib画图.
example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import matplotlib.pyplot as plt
% matplotlib inline

# two required arguments in pyplot's bar function: the x-coordinates of the bars, and the heights of the bars.
plt.bar([1, 2, 3], [224, 620, 425]);

# specify the x tick labels using pyplot's xticks function, or by specifying another parameter in the bar function. The two cells below accomplish the same thing.
# plot bars
plt.bar([1, 2, 3], [224, 620, 425])

# specify x coordinates of tick labels and their labels
plt.xticks([1, 2, 3], ['a', 'b', 'c']);

# plot bars with x tick labels
plt.bar([1, 2, 3], [224, 620, 425], tick_label=['a', 'b', 'c']);

# Set the title and label axes like this.
plt.bar([1, 2, 3], [224, 620, 425], tick_label=['a', 'b', 'c'])
plt.title('Some Title')
plt.xlabel('Some X Label')
plt.ylabel('Some Y Label');

another example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Use query to select each group and get its mean quality
median = df['alcohol'].median()
low = df.query('alcohol < {}'.format(median))
high = df.query('alcohol >= {}'.format(median))

mean_quality_low = low['quality'].mean()
mean_quality_high = high['quality'].mean()

# Create a bar chart with proper labels
locations = [1, 2]
heights = [mean_quality_low, mean_quality_high]
labels = ['Low', 'High']
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Quality Ratings by Alcohol Content')
plt.xlabel('Alcohol Content')
plt.ylabel('Average Quality Rating');


# Use groupby to get the mean quality for each acidity level
acidity_level_quality_means = df.groupby('acidity_levels').quality.mean()

# Create a bar chart with proper labels
locations = [4, 1, 2, 3] # reorder values above to go from low to high
heights = acidity_level_quality_means

# labels = ['Low', 'Medium', 'Moderately High', 'High']
labels = acidity_level_quality_means.index.str.replace('_', ' ').str.title() # alternative to commented out line above

plt.bar(locations, heights, tick_label=labels)
plt.title('Average Quality Ratings by Acidity Level')
plt.xlabel('Acidity Level')
plt.ylabel('Average Quality Rating');

seaborn

用seabron画图.
example

1
import seaborn as sns

examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')

wine_df = pd.read_csv('winequality_edited.csv')

# get counts for each rating and color
color_counts = wine_df.groupby(['color', 'quality']).count()['pH']
color_counts

# get total counts for each color
color_totals = wine_df.groupby('color').count()['pH']
color_totals

# get proportions by dividing red rating counts by total # of red samples
red_proportions = color_counts['red'] / color_totals['red']
red_proportions

# we're missing a red wine value for a the 9 rating. Even though this number is a 0, we need it for our plot. Run the last two cells after running the cell below.
red_proportions['9'] = 0
red_proportions


# get proportions by dividing white rating counts by total # of white samples
white_proportions = color_counts['white'] / color_totals['white']
white_proportions

ind = np.arange(len(red_proportions)) # the x locations for the groups
width = 0.35 # the width of the bars

# plot bars
red_bars = plt.bar(ind, red_proportions, width, color='r', alpha=.7, label='Red Wine')
white_bars = plt.bar(ind + width, white_proportions, width, color='w', alpha=.7, label='White Wine')

# title and labels
plt.ylabel('Proportion')
plt.xlabel('Quality')
plt.title('Proportion by Wine Color and Quality')
locations = ind + width / 2 # xtick locations
labels = ['3', '4', '5', '6', '7', '8', '9'] # xtick labels
plt.xticks(locations, labels)

# legend
plt.legend()

sciPy

用sciPy算数学相关_.

random

doc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

# outcome of one coin flip
np.random.randint(2)

# outcomes of ten thousand coin flips
np.random.randint(2, size=10000)

# mean outcome of ten thousand coin flips
np.random.randint(2, size=10000).mean()

# outcome of one coin flip
np.random.choice([0, 1])

# outcome of ten thousand coin flips
np.random.choice([0, 1], size=10000)

# mean outcome of ten thousand coin flips
np.random.choice([0, 1], size=10000).mean()

# outcomes of ten thousand biased coin flips
np.random.choice([0, 1], size=10000, p=[0.8, 0.2])

# mean outcome of ten thousand biased coin flips
np.random.choice([0, 1], size=10000, p=[0.8, 0.2]).mean()

如果我们要一次进行多组数据,使用array,比如说一次扔三次硬币,重复多次.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# simulate 1 million tests of three fair coin flips
tests = np.random.randint(2, size=(int(1e6), 3))

# sums of all tests
test_sums = tests.sum(axis=1)

# proportion of tests that produced exactly one head
(test_sums == 2).mean()

# simulate 1 million tests of three biased coin flips
# hint: use np.random.choice()
tests = np.random.choice([0, 1], size=(int(1e6), 3), p=[0.6, 0.4])

# sums of all tests
test_sums = tests.sum(axis=1)

# proportion of tests that produced exactly one head
(test_sums == 2).mean()

# 或者是投骰子

# simulate the first million die rolls
first = np.random.choice(np.arange(6), size=int(1e6))

# simulate the second million die rolls
second = np.random.choice(np.arange(6), size=int(1e6))

# proportion of tests where the 1st and 2nd die rolled the same number
(first == second).mean()

binomial

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np

# number of heads from 10 fair coin flips,扔十次0.5概率朝上的硬币,return朝上的次数
np.random.binomial(10, 0.5)

# results from 20 tests with 10 coin flips,return一个array,每一个在array中的数字都是10个硬币中朝上的数量
np.random.binomial(10, 0.5, 20)

# mean number of heads from the 20 tests
np.random.binomial(10, 0.5, 20).mean()

import matplotlib.pyplot as plt
% matplotlib inline

# 画histgram可以得到符合binomial的dist图.
plt.hist(np.random.binomial(10, 0.5, 1000000));

注:如果我们的数据在一个array里,我们可以用arr.mean()来计算mean.同理也有arr.std()和arr.var()可以计算standard deviation和variance.




练习notes

Pandas

A column in a df has boolean True/False values, but for further calculations, we need 1/0 representation. How would you transform it?

用astype(int)来转换.

1
df["somecolumn"] = df["somecolumn"].astype(int)



Describe how you will get the names of columns of a DataFrame in Pandas?

1
list(data.columns)



How are iloc() and loc() different?

这个我会!loc需要的key的值,而iloc是index number.




How can you sort the DataFrame?

用DataFrame.sort_values().

1
2
3
4
5
6
7
8
#Sort by col1
df.sort_values(by=['col1'])

# Sort by multiple columns
df.sort_values(by=['col1', 'col2'])

# Sort descending
df.sort_values(by='col1', ascending=False)



How can you find the row for which the value of a specific column is max or min?

1
2
df['A'].idxmax()
df['A'].idxmin()

This return column index.




How to split a string column in a DataFrame into two columns?

Use str.split.

1
2
df[['code','location']] = df['row'].str.split(n=1, expand=True)
# 把row分成code和location.



How to check whether a Pandas DataFrame is empty?

1
2
if df.empty:
print('DataFrame is empty!')



How would you iterate over rows in a DataFrame in Pandas?

1
2
3
4
5
6
import pandas as pd

df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})

for index, row in df.iterrows():
print(row['c1'], row['c2'])



Compare the Pandas methods: map(), applymap(), apply()

The map() method is an elementwise method for only Pandas Series, it maps values of Series according to input correspondence.
It accepts dicts, Series, or callable. Values that are not found in the dict are converted to NaN,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Example:

s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
print(s)
0 cat
1 dog
2 NaN
3 rabbit

s.map({'cat': 'kitten', 'dog': 'puppy'})

0 kitten
1 puppy
2 NaN
3 NaN

The applymap() method is an elementwise function for only DataFrames, it applies a function that accepts and returns a scalar to every element of a DataFrame.

It accepts callables only i.e a Python function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Example:

df = pd.DataFrame([[1, 2.12], [3.356, 4.567]])
print(df)
0 1
0 1.000 2.120
1 3.356 4.567

# Square each element

df.applymap(lambda x: x**2)
0 1
0 1.000000 4.494400
1 11.262736 20.857489

The apply() method also works elementwise, as it applies a function along input axis of DataFrame. It is suited to more complex operations and aggregation.

It accepts the callables parameter as well.

The applymap() method is an elementwise function for only DataFrames, it applies a function that accepts and returns a scalar to every element of a DataFrame.

It accepts callables only i.e a Python function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])

print(df)
A B
0 4 9
1 4 9
2 4 9

df.apply(np.sqrt)

A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0



Describe how you can combine (merge) data on Common Columns or Indices?

Using .merge() method which merges DataFrame or named Series objects with a database-style join. You have inner, left, right and outer merge operation.

By default, the Pandas merge operation acts with an “inner” merge. An inner merge, keeps only the common values in both the left and right dataframes for the result.

Left merge, keeps every row in the left dataframe. Where there are missing values of the “on” variable in the right dataframe, it adds empty / NaN values in the result.

Right merge, keeps every row in the right dataframe. Where there are missing values of the “on” variable in the left column, it adds empty / NaN values in the result.

A full outer join returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up rows where possible, with NaNs elsewhere




How can I achieve the equivalents of SQL’s IN and NOT IN in Pandas?

Use pd.Series.isin.

For IN use: something.isin(somewhere)

1
df[df['A'].isin([3, 6])]

For NOT IN: ~something.isin(somewhere)

1
2
3
4
df[-df["A"].isin([3, 6])]
df[~df["A"].isin([3, 6])]
df[df["A"].isin([3, 6]) == False]
df[np.logical_not(df["A"].isin([3, 6]))]



How do you split a DataFrame according to a boolean criterion?

We can create a mask to separate the dataframe and then use the inverse operator (~) to take the complement of the mask.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import pandas as pd

# Create example data
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
df
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
3 bar three 3 6
4 foo two 4 8
5 bar two 5 10
6 foo one 6 12
7 foo three 7 14

# Mask with boolean criterion
m = df['A'] != 'foo'
# Split dataframe
a, b = df[m], df[~m]
# Results
a
A B C D
1 bar one 1 2
3 bar three 3 6
5 bar two 5 10
b
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14



How would you convert continuous values into discrete values in Pandas?

Depending on the problem, continuous values can be discretized using the cut() or qcut() function:

cut() bins the data based on values. We use it when we need to segment and sort data values into bins evenly spaced. cut will choose the bins to be evenly spaced according to the values themselves and not the frequency of those values. For example, cut could convert ages to groups of age ranges.

qcut() bins the data based on sample quantiles. We use it when we want to have the same number of records in each bin or simply study the data by quantiles. For example, if in a data we have 30 records, and we want to compute the quintiles, qcut() will divide the data such that we have 6 records in each bin.




How would you create Test (20%) and Train (80%) Datasets with Pandas?

scikit learn’s train_test_split is a good one - it will split both numpy arrays as dataframes.

1
2
3
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)



Name some type conversion methods in Pandas:

  • to_numeric()
    provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type.
  • astype()
    convert (almost) any type to (almost) any other type. Also allows you to convert to categorial types (very useful).
  • infer_objects()
    a utility method to convert object columns holding Python objects to a pandas type if possible. It does this by inferring better dtypes for object columns.
  • convert_dtypes()
    convert DataFrame columns to the “best possible” dtype that supports pd.NA (pandas’ object to indicate a missing value).



Pivot Table Challenge

To provide the required df we use the function pivot_table with the parameters index=’Col X’, columns=’Col Y’ and as aggfunc, len:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
pd.pivot_table(df, index=['Col X'], columns=['Col Y'], aggfunc=len, fill_value=0)

Col Y cat 1 cat 2 cat 3
Col X
class 1 1 0 0
class 2 1 0 1
class 3 0 1 0

df.groupby(['Col X','Col Y']).size().unstack('Col Y', fill_value=0)['Col Y'], aggfunc=len, fill_value=0)

Col Y cat 1 cat 2 cat 3
Col X
class 1 1 0 0
class 2 1 0 1
class 3 0 1 0



What is the difference(s) between merge() and concat() in Pandas?

  • .concat() simply stacks multiple DataFrame together either vertically, or stitches horizontally after aligning on index.
  • .merge() first aligns two DataFrame’ selected common column(s) or index, and then pick up the remaining columns from the aligned rows of each DataFrame.



What’s the difference between at and iat in Pandas?

这两个function都没有见过…!

at and iat are functions meant to access a scalar, that is, a single element in the dataframe.

With .at:

Selection is label based but it only selects a single ‘cell’ in your DataFrame.
We can assign new indices and columns.
To use .at, pass it both a row and column label separated by a comma.
With .iat:

Selection with .iat is position based but it only selects a single scalar value(like index).
We can’t assign new indices and columns.
To use iat you must pass it an integer for both the row and column locations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
columns=['A', 'B', 'C'],
index=['first', 'second', 'third'])

df
A B C
first 0 2 3
second 0 4 1
third 10 20 30

df.at['second','B',]
4
df.iat[1, 1]
4



What’s the difference between interpolate() and fillna() in Pandas?

  • fillna() fills the NaN values with a given number with which you want to substitute. It gives you an option to fill according to the index of rows of a pd.DataFrame or on the name of the columns in the form of a python dict.

  • interpolate() it gives you the flexibility to fill the missing values with many kinds of interpolations between the values like linear, time, etc (which fillna does not provide).




What’s the difference between pivot_table() and groupby()?

Both pivot_table and groupby are used to aggregate your dataframe. The difference is only with regard to the shape of the result.

When using pd.pivot_table(df, index=[“a”], columns=[“b”], values=[“c”], aggfunc=np.sum) a table is created where:

  • a is on the row axis,
  • b is on the column axis,
    and the values are the sum of c.

For example,

1
2
3
4
5
6
7
8
df = pd.DataFrame({"a": [1,2,3,1,2,3], "b":[1,1,1,2,2,2], "c":np.random.rand(6)})
pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)
c
b 1 2
a
1 0.528470 0.484766
2 0.187277 0.144326
3 0.866832 0.650100

Using groupby:

  • The dimensions given are placed into columns.
  • The rows are created for each combination of those dimensions.
  • To obtain the “equivalent” output as before, we can create a Series of the sum of values c, grouped by all unique combinations of a and b.
1
2
3
4
5
6
7
8
9
df.groupby(['a','b'])['c'].sum()
a b
1 1 0.951714
2 0.270743
2 1 0.661338
2 0.343530
3 1 0.210315
2 0.220240
Name: c, dtype: f



How would you deal with large CSV files in Pandas?

Pandas is in-memory tool library. So we need to be able to fit your data in memory to use pandas with it. If you come across a large CSV file that you want to process, we have a few options.

If you can process portions of it at a time, you can read it into chunks and process each chunk with the chunksize parameter.
Alternatively, there are a few hints to help pare down the file size.

  • Use the nrows parameter of read_csv to limit how much data we load to a small sample.
  • After reading in our limtit dataset, we perform data exploration and inspect the data types of each column with the dtypes method, then we look for the memory usage of each column with the .memory_usage method. With this information, we can have a hint of what columns need to be transformed. For example, if we have a dummy column we convert it in an 8-bit integer with dummy_column.astype(np.int8).
  • To save even more memory, we will want to consider changing object data types to category if they have a reasonably low cardinality.
  • If there are columns that we know we can ignore, we can use the usecols parameter of the read_csv() function to specify the columns we want to load.
  • If we assess that with these transformations the memory usage decreases, we can then repeat this process with the rest of the data as chunks.



What does the stack() and unstack() functions do in a DataFrame?

The concept of stacking comes in handy when we have data with multi-indices.

The stack() function rotates or pivots the innermost column index into the innermost row index.
The unstack() function does exactly the inverse operation, it will convert the innermost row index back into the innermost column index.

Therefore,

Stacking means rearranging the data vertically (or stacking it on top of each other, hence the name stacking), making the shape of the dataframe a taller and narrower stack (with fewer columns and more rows).
And unstacking will do just the opposite, spreading out the data and reshaping it into a shorter but wider dataframe (with fewer rows but more columns).