テキストデータの処理
Contents
テキストデータの処理#
本節ではpandasを利用してテキストデータを処理する方法を学びます。
サンプルデータとして、 Python公式ドキュメント のモジュール索引をDataFrameに読み込みます。
import pandas as pd
df = (
pd.read_html("https://docs.python.org/3/py-modindex.html")[0]
.drop(0, axis=1)
.rename({1: "module", 2: "description"}, axis=1)
.dropna()
)
df
module | description | |
---|---|---|
2 | __future__ | Future statement definitions |
3 | __main__ | The environment where top-level code is run. C... |
4 | _thread | Low-level threading API. |
7 | abc | Abstract base classes according to :pep:`3119`. |
8 | aifc | Deprecated: Read and write audio files in AIF... |
... | ... | ... |
387 | zipapp | Manage executable Python zip archives |
388 | zipfile | Read and write ZIP-format archive files. |
389 | zipimport | Support for importing Python modules from ZIP ... |
390 | zlib | Low-level interface to compression and decompr... |
391 | zoneinfo | IANA time zone support |
331 rows × 2 columns
.strアクセサ#
Series には「.strアクセサ」と呼ばれる各要素の文字列を操作する機能があります。strアクセサからPython組み込みのstr型と同等のメソッドが利用できます。
df.loc[:, "description"].str
<pandas.core.strings.accessor.StringMethods at 0x7fe307ee3350>
「description」列において Deprecated:
から始まる文字を抽出する処理を検討します。.strアクセサから、Pythonのstr型の startswith メソッドと同等なメソッドを呼び出せます。
df.loc[:, "description"].str.startswith("Deprecated:")
2 False
3 False
4 False
7 False
8 True
...
387 False
388 False
389 False
390 False
391 False
Name: description, Length: 331, dtype: bool
.str.startswith メソッドは真理値を返すため、.locインデクサに渡すことで True
に該当するデータを抽出できます。
df.loc[df.loc[:, "description"].str.startswith("Deprecated:"), :]
module | description | |
---|---|---|
8 | aifc | Deprecated: Read and write audio files in AIF... |
12 | asynchat | Deprecated: Support for asynchronous command/... |
14 | asyncore | Deprecated: A base class for developing async... |
16 | audioop | Deprecated: Manipulate raw audio data. |
28 | cgi | Deprecated: Helpers for running Python script... |
29 | cgitb | Deprecated: Configurable traceback handler fo... |
30 | chunk | Deprecated: Module to read IFF chunks. |
48 | crypt (Unix) | Deprecated: The crypt() function used to chec... |
171 | imghdr | Deprecated: Determine the type of image conta... |
172 | imp | Deprecated: Access the implementation of the ... |
203 | mailcap | Deprecated: Mailcap file handling. |
209 | msilib (Windows) | Deprecated: Creation of Microsoft Installer f... |
221 | nis (Unix) | Deprecated: Interface to Sun's NIS (Yellow Pa... |
222 | nntplib | Deprecated: NNTP protocol client (requires so... |
227 | optparse | Deprecated: Command-line option parsing library. |
230 | ossaudiodev (Linux, FreeBSD) | Deprecated: Access to OSS-compatible audio de... |
237 | pipes (Unix) | Deprecated: A Python interface to Unix shell ... |
275 | smtpd | Deprecated: A SMTP server implementation in P... |
277 | sndhdr | Deprecated: Determine type of a sound file. |
280 | spwd (Unix) | Deprecated: The shadow password database (get... |
289 | sunau | Deprecated: Provide an interface to the Sun A... |
298 | telnetlib | Deprecated: Telnet client class. |
347 | uu | Deprecated: Encode and decode files in uuenco... |
369 | xdrlib | Deprecated: Encoders and decoders for the Ext... |
次に「description」列において Deprecated:
の文字列を削除する処理を検討します。 .str.replace メソッドの第1引数に置換前の文字列、第2引数に置換後の文字列を渡します。ここでは引数 regex
に False
を渡して正規表現のパターンマッチングをオフにしています。
df.loc[:, "description"].str.replace("Deprecated: ", "", regex=False)
2 Future statement definitions
3 The environment where top-level code is run. C...
4 Low-level threading API.
7 Abstract base classes according to :pep:`3119`.
8 Read and write audio files in AIFF or AIFC fo...
...
387 Manage executable Python zip archives
388 Read and write ZIP-format archive files.
389 Support for importing Python modules from ZIP ...
390 Low-level interface to compression and decompr...
391 IANA time zone support
Name: description, Length: 331, dtype: object
.str.split メソッドは文字列をリストに分割します。引数を指定しない場合は空白文字で区切られます。
df.loc[:, "description"].str.split()
2 [Future, statement, definitions]
3 [The, environment, where, top-level, code, is,...
4 [Low-level, threading, API.]
7 [Abstract, base, classes, according, to, :pep:...
8 [Deprecated:, Read, and, write, audio, files, ...
...
387 [Manage, executable, Python, zip, archives]
388 [Read, and, write, ZIP-format, archive, files.]
389 [Support, for, importing, Python, modules, fro...
390 [Low-level, interface, to, compression, and, d...
391 [IANA, time, zone, support]
Name: description, Length: 331, dtype: object
引数 expand
に True
を渡すことで分割された文字列が列に展開されます。
df.loc[:, "description"].str.split(expand=True)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | Future | statement | definitions | None | None | None | None | None | None | None | None | None | None | None | None | None | None |
3 | The | environment | where | top-level | code | is | run. | Covers | command-line | interfaces, | import-time | behavior, | and | ``__name__ | == | '__main__'``. | None |
4 | Low-level | threading | API. | None | None | None | None | None | None | None | None | None | None | None | None | None | None |
7 | Abstract | base | classes | according | to | :pep:`3119`. | None | None | None | None | None | None | None | None | None | None | None |
8 | Deprecated: | Read | and | write | audio | files | in | AIFF | or | AIFC | format. | None | None | None | None | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
387 | Manage | executable | Python | zip | archives | None | None | None | None | None | None | None | None | None | None | None | None |
388 | Read | and | write | ZIP-format | archive | files. | None | None | None | None | None | None | None | None | None | None | None |
389 | Support | for | importing | Python | modules | from | ZIP | archives. | None | None | None | None | None | None | None | None | None |
390 | Low-level | interface | to | compression | and | decompression | routines | compatible | with | gzip. | None | None | None | None | None | None | None |
391 | IANA | time | zone | support | None | None | None | None | None | None | None | None | None | None | None | None | None |
331 rows × 17 columns
.strアクセサに添え字を渡すと、スライス記法が利用できます。次のコードでは「module」列の先頭3文字を取得しています。
df.loc[:, "module"].str[:3]
2 __f
3 __m
4 _th
7 abc
8 aif
...
387 zip
388 zip
389 zip
390 zli
391 zon
Name: module, Length: 331, dtype: object
練習問題1#
df
オブジェクトの「module」列から "(Windows)"
が含まれる行を抽出してください。
module | description | |
---|---|---|
209 | msilib (Windows) | Deprecated: Creation of Microsoft Installer files, and CAB files. |
210 | msvcrt (Windows) | Miscellaneous useful routines from the MS VC++ runtime. |
358 | winreg (Windows) | Routines and objects for manipulating the Windows registry. |
359 | winsound (Windows) | Access to the sound-playing machinery for Windows. |
# 解答セル