import pandas as pd
Pandas is a high-level data manipulation tool(advanced numpy) that allows you to manipulate tabular data easily.
Two tutorials:
DataFrames is a key data structure in Pandas(advanced 2-dimension ndarray in numpy).
dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
"capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
"area": [8.516, 17.10, 3.286, 9.597, 1.221],
"population": [200.4, 143.5, 1252, 1357, 52.98] }
brics = pd.DataFrame(dict)
brics
# data index, row index
brics.index
brics.index = ["BR", "RU", "IN", "CH", "SA"] # length must match
brics
# column index
brics.columns
brics.describe() # describe the statistics of data of number type
# column index, by name
brics[['area', 'capital']]
# row index
brics[1:4]
# table index by name, the fist dim is row and the second dim is column
brics.loc[['RU','IN'], ['area', 'capital']]
# table index by index
brics.iloc[:2, :2]
# boolean indexing
brics[brics.area > 4]
for t in brics: # feel like a dict by column for 'for loop'
print(t) # key
print(type(brics[t])) # feel like one column dataframe, actually a Series
print(brics[t]) # value,
# sort by index, by row or column
# by row
brics.sort_index(axis=0, ascending=False)
# by column
brics.sort_index(axis=1, ascending=False)
# by value, same to by row index, but sort by the values in some columns
brics.sort_values('population')
brics.sort_values(['area', 'population']) # dict order for multi columns
area = brics['area']
print(type(area))
area # fill like a one column DataFrame, but actually not.
# one column datafarme, just like matrix (n, 1) vs vector (n,)
brics[['area']]
# series to dataframe
brics_area = pd.DataFrame(area)
brics_area
# fit with an array: area.values
area2 = pd.Series(area.values)
area2
# 'for loop' test
for v in area: # like a array
print(v)
for i in area.index: # but with index value
print(i)
# index Series by index name
area[area.index[0]]
# by index
area[0]
# sort is the same to dataframe
area.sort_index()
area.sort_values(ascending=False)
# jsut assign value like numpy
brics[:1] = 0 # auto broadcast
brics
brics[:1] = [8.516, 'Brasilia', 'Brazi', 200.40]
brics
np_brics = brics.values # no row indices and column names
np_brics
brics.to_numpy()
np_area = area.values # no row indices
np_area
area.to_numpy()
# do not share the same reference
np_brics[0,0] = 0
np_brics # change to 0
brics # keep origin value
The operations above are also available in numpy, we acctually do not have to use Pandas. Here are some Pandas features.
# sql style join
pd.merge(left=brics, right=brics, on='area')
# numpy style
brics.append(brics)
# column append
brics_new = brics.copy()
brics_new.columns = ['a', 'b', 'c', 'd']
brics.join(brics_new)
By “group by” we are referring to a process involving one or more of the following steps:
brics2 = brics.append(brics)
# step 1
area_group = brics2.groupby('area')
# step 2 and 3, the followings are the same
area_group.count()
area_group.sum() # onlt support number type
area_group.mean()