Open In Colab

テキストデータの処理#

本節ではpandasを利用してテキストデータを処理する方法を学びます。

サンプルデータとして、 Python公式ドキュメント のモジュール索引をDataFrameに読み込みます。

import pandas as pd

df = (
    pd.read_html("https://docs.python.org/3/py-modindex.html")[0]
    .drop(0, axis=1)
    .rename({1: "module", 2: "description"}, axis=1)
    .dropna()
)
df
module description
2 __future__ Future statement definitions
3 __main__ The environment where top-level code is run. C...
4 _thread Low-level threading API.
7 abc Abstract base classes according to :pep:`3119`.
8 aifc Deprecated: Read and write audio files in AIF...
... ... ...
387 zipapp Manage executable Python zip archives
388 zipfile Read and write ZIP-format archive files.
389 zipimport Support for importing Python modules from ZIP ...
390 zlib Low-level interface to compression and decompr...
391 zoneinfo IANA time zone support

331 rows × 2 columns

.strアクセサ#

Series には「.strアクセサ」と呼ばれる各要素の文字列を操作する機能があります。strアクセサからPython組み込みのstr型と同等のメソッドが利用できます。

df.loc[:, "description"].str
<pandas.core.strings.accessor.StringMethods at 0x7fe307ee3350>

「description」列において Deprecated: から始まる文字を抽出する処理を検討します。.strアクセサから、Pythonのstr型の startswith メソッドと同等なメソッドを呼び出せます。

df.loc[:, "description"].str.startswith("Deprecated:")
2      False
3      False
4      False
7      False
8       True
       ...  
387    False
388    False
389    False
390    False
391    False
Name: description, Length: 331, dtype: bool

.str.startswith メソッドは真理値を返すため、.locインデクサに渡すことで True に該当するデータを抽出できます。

df.loc[df.loc[:, "description"].str.startswith("Deprecated:"), :]
module description
8 aifc Deprecated: Read and write audio files in AIF...
12 asynchat Deprecated: Support for asynchronous command/...
14 asyncore Deprecated: A base class for developing async...
16 audioop Deprecated: Manipulate raw audio data.
28 cgi Deprecated: Helpers for running Python script...
29 cgitb Deprecated: Configurable traceback handler fo...
30 chunk Deprecated: Module to read IFF chunks.
48 crypt (Unix) Deprecated: The crypt() function used to chec...
171 imghdr Deprecated: Determine the type of image conta...
172 imp Deprecated: Access the implementation of the ...
203 mailcap Deprecated: Mailcap file handling.
209 msilib (Windows) Deprecated: Creation of Microsoft Installer f...
221 nis (Unix) Deprecated: Interface to Sun's NIS (Yellow Pa...
222 nntplib Deprecated: NNTP protocol client (requires so...
227 optparse Deprecated: Command-line option parsing library.
230 ossaudiodev (Linux, FreeBSD) Deprecated: Access to OSS-compatible audio de...
237 pipes (Unix) Deprecated: A Python interface to Unix shell ...
275 smtpd Deprecated: A SMTP server implementation in P...
277 sndhdr Deprecated: Determine type of a sound file.
280 spwd (Unix) Deprecated: The shadow password database (get...
289 sunau Deprecated: Provide an interface to the Sun A...
298 telnetlib Deprecated: Telnet client class.
347 uu Deprecated: Encode and decode files in uuenco...
369 xdrlib Deprecated: Encoders and decoders for the Ext...

次に「description」列において Deprecated: の文字列を削除する処理を検討します。 .str.replace メソッドの第1引数に置換前の文字列、第2引数に置換後の文字列を渡します。ここでは引数 regexFalse を渡して正規表現のパターンマッチングをオフにしています。

df.loc[:, "description"].str.replace("Deprecated: ", "", regex=False)
2                           Future statement definitions
3      The environment where top-level code is run. C...
4                               Low-level threading API.
7        Abstract base classes according to :pep:`3119`.
8       Read and write audio files in AIFF or AIFC fo...
                             ...                        
387                Manage executable Python zip archives
388             Read and write ZIP-format archive files.
389    Support for importing Python modules from ZIP ...
390    Low-level interface to compression and decompr...
391                               IANA time zone support
Name: description, Length: 331, dtype: object

.str.split メソッドは文字列をリストに分割します。引数を指定しない場合は空白文字で区切られます。

df.loc[:, "description"].str.split()
2                       [Future, statement, definitions]
3      [The, environment, where, top-level, code, is,...
4                           [Low-level, threading, API.]
7      [Abstract, base, classes, according, to, :pep:...
8      [Deprecated:, Read, and, write, audio, files, ...
                             ...                        
387          [Manage, executable, Python, zip, archives]
388      [Read, and, write, ZIP-format, archive, files.]
389    [Support, for, importing, Python, modules, fro...
390    [Low-level, interface, to, compression, and, d...
391                          [IANA, time, zone, support]
Name: description, Length: 331, dtype: object

引数 expandTrue を渡すことで分割された文字列が列に展開されます。

df.loc[:, "description"].str.split(expand=True)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2 Future statement definitions None None None None None None None None None None None None None None
3 The environment where top-level code is run. Covers command-line interfaces, import-time behavior, and ``__name__ == '__main__'``. None
4 Low-level threading API. None None None None None None None None None None None None None None
7 Abstract base classes according to :pep:`3119`. None None None None None None None None None None None
8 Deprecated: Read and write audio files in AIFF or AIFC format. None None None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
387 Manage executable Python zip archives None None None None None None None None None None None None
388 Read and write ZIP-format archive files. None None None None None None None None None None None
389 Support for importing Python modules from ZIP archives. None None None None None None None None None
390 Low-level interface to compression and decompression routines compatible with gzip. None None None None None None None
391 IANA time zone support None None None None None None None None None None None None None

331 rows × 17 columns

.strアクセサに添え字を渡すと、スライス記法が利用できます。次のコードでは「module」列の先頭3文字を取得しています。

df.loc[:, "module"].str[:3]
2      __f
3      __m
4      _th
7      abc
8      aif
      ... 
387    zip
388    zip
389    zip
390    zli
391    zon
Name: module, Length: 331, dtype: object

練習問題1#

df オブジェクトの「module」列から "(Windows)" が含まれる行を抽出してください。

module description
209 msilib (Windows) Deprecated: Creation of Microsoft Installer files, and CAB files.
210 msvcrt (Windows) Miscellaneous useful routines from the MS VC++ runtime.
358 winreg (Windows) Routines and objects for manipulating the Windows registry.
359 winsound (Windows) Access to the sound-playing machinery for Windows.
# 解答セル