[Pandas] 01. DataFrame, Series

Updated: November 24, 2021

import pandas as pd
import numpy as np

1. Series

Series는 one-dimensional labeled array이다.

특징

array: 대부분의 data type을 지원한다. (integer,string, floating point number, Python object)
labeled: array안에 담긴 data를 label을 이용해 접근할 수 있다. 이 때 label은 index라고 지칭한다.

사용법

s = pd.Series(data,index=index)

data는 series로 담을 data를 넣어준다.

index는 1차원 list형태로 label을 넣어준다. index를 따로 넣지 않는 경우에는 0 ~ len(data)-1 로 index가 설정이 된다.

1.1. ndarray를 이용해 만들기

s = pd.Series(np.random.randn(5),index=list('abcde'))

print(s)

print(s.index)

a   -1.295967
b    0.780990
c    0.026224
d    0.973797
e   -2.681411
dtype: float64
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

s = pd.Series(np.random.randint(1,6,(5,)))

print(s)
print(s.index)

  5
  1
  5
  5
  2
dtype: int64
RangeIndex(start=0, stop=5, step=1)

1.2. dictionary를 이용해 만들기

dictionary를 이용해 만들면 key가 index가 되고 value가 data가 된다.

d = {'a':1 ,"b":2, "c":3}

s = pd.Series(d)

print(s)
print(s.index)

a    1
b    2
c    3
dtype: int64
Index(['a', 'b', 'c'], dtype='object')

1.3. scalar value를 이용해 만들기

scalar value를 이용해 만들면 scalar value가 len(index)만큼 복사되어 만들어진다.

s = pd.Series(3,index=list('abcde'))

print(s)

a    3
b    3
c    3
d    3
e    3
dtype: int64

1.4. name attribute

Series는 name을 지정해줄 수 있다.

s = pd.Series(np.random.randint(1,6,(5,)), name="randint array")

print(s)
print(f"s.name: {s.name}")

  2
  4
  4
  4
  2
Name: randint array, dtype: int64
s.name: randint array

2. Series에서 data 가져오기

2.1. ndarray처럼 이용하기

ndarray와 같이 index slicing을 사용할 수 있으며 elementary operation 또한 지원한다.

s = pd.Series(np.random.randint(-5,6,(5,)),index=list('abcde'))
print(f"s:\n{s}\n")
print(f"s[:3]: \n{s[:3]}\n")

print(f"s[s>0]: \n{s[s>0]}\n")
print(f"np.exp(s): \n{np.exp(s)}")

s:
a    2
b    2
c    0
d   -5
e    0
dtype: int64

s[:3]:
a    2
b    2
c    0
dtype: int64

s[s>0]:
a    2
b    2
dtype: int64

np.exp(s):
a    7.389056
b    7.389056
c    1.000000
d    0.006738
e    1.000000
dtype: float64

s = pd.Series(np.random.randint(-5,6,(5,)),index=list('abcde'))
print("Pandas Array")
print(s.array)
print("\nNumPy")
print(s.to_numpy())

Pandas Array
<PandasArray>
[0, 2, 1, -3, -3]
Length: 5, dtype: int64

NumPy
[ 0  2  1 -3 -3]

2.2. dictionary 처럼 이용하기

index label을 key값처럼 사용해 data에 접근할 수 있다.

s = pd.Series(np.random.randint(-5,6,(5,)),index=list('abcde'))

print(s['a'])
print(s['e'])
print(s.get('b'))

-1
-2
-3

3. DataFrame

DataFrame은 2-dimensional labeled data structure이다.

특징

2-dimensional : 2차원 구조를 지니기 때문에 행과 열을 가진 table처럼 사용할 수 있다.
labeled : Series와 각 행과 열, data를 label로 접근할 수 있다. row label은 index, column label은 columns로 지칭한다.
data structure: list, dictionary, Series, numpy ndarray, another DataFrame 모두 지원한다.

기본적인 사용법

df = pd.DataFrame(data,index=index, columns=columns)

3.1. dictionary를 이용해 만들기

다양한 data type을 dictionary에 넣고 dictionary를 이용해 DataFrame을 만들 수 있다.

dictionary를 이용해 DataFrame을 만드는 경우 key가 column label이 된다.

d = {'a':[1,2,3],"b":[4,5,6],"c":[7,8,9]}

df1 = pd.DataFrame(d)

df2 = pd.DataFrame(d,index=['d','e','f'])

print(df1)
print()
print(df2)

d = {'col A':pd.Series(np.random.randint(1,5,(3,)),index=list("abc")),
     "col B":pd.Series(np.random.randint(1,5,(3,)),index=list("abc")),
     "col C":pd.Series(np.random.randint(1,5,(3,)),index=list("abc"))}

df = pd.DataFrame(d)

df

	col A	col B	col C
a	3	1	2
b	2	3	3
c	2	2	1

3.2. dictionary list를 이용해 만들기

원소로 들어간 dictionary가 하나의 행을 이룬다.

이 때, 각 dictionary의 key가 column label이 된다.

listOfDict = [{"a": 1, "b": 2}, {"a": 5, "b": 10}]

df1 = pd.DataFrame(listOfDict)

df2 = pd.DataFrame(listOfDict,index=['row 1', 'row 2'])

print(df1)
print()
print(df2)

   a   b
0  1   2
1  5  10

       a   b
row 1  1   2
row 2  5  10

3.3 DataFrame.from_dict()

dictionary를 이용해 DataFrame을 빠르게 만들 수 있는 메소드이다.

dictionary뿐만 아니라 OrderedDict, dict()으로 형변환된 data 등을 지원한다.

key가 column label로 사용이 된다.

df = pd.DataFrame.from_dict(dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]))

df

	A	B
0	1	4
1	2	5
2	3	6

key를 row label로 사용하고 싶은 경우 orient='index' parameter를 추가하면 된다.

이 때 각 dictionary는 행으로 들어간다.

df = pd.DataFrame.from_dict(
    dict([("a", [1, 2, 3]), ("b", [4, 5, 6])]),
    orient="index",
    columns=["col 1", "col 2", "col 3"],
)

df

	col 1	col 2	col 3
a	1	2	3
b	4	5	6

3.4. DataFrame.from_records() 메소드 사용하기

ndarray 또는 tuple로 이루어진 list를 가져와서 DataFrame을 만든다.

l = [(1,2,3),(4,5,6)]

df = pd.DataFrame.from_records(l,index=['a','b'], columns=['A','B','C'])

df

	A	B	C
a	1	2	3
b	4	5	6

nkw