Google Playレビューのトピック抽出：LDAとBERTopicで話題を可視化【Python】

はじめに

本記事では、Google Playのレビュー本文から「どんな話題が語られているか」を自動で抽出するために、LDA（Latent Dirichlet Allocation）とBERTopicの2手法を紹介します。
LDAは単語の共起確率に基づく古典的手法、BERTopicは多言語対応の文脈ベクトル（BERT埋め込み）に基づく手法で、日本語レビューにも有効です。

前提：レビューの取得は「データ取得編」、可視化・テキスト整形は「可視化編」「テキスト分析編」を参照。
本記事はローカルのCSV（例：reviews_genshin_paged.csv）を読み込みます。

0. 手法の概要（どちらを使う？）

LDA： BoW（単語カウント）前提・軽量で速い／ハイパラが直感的（n_topics）。短文多数でも動く。
→ まずの一手・軽量に話題の当たりをつけたいとき。
BERTopic： 文章埋め込み＋クラスタリングで意味的に近い文をまとめる。言い換え・同義語に強い。
→ 文脈の近さを反映したクラスタが欲しいとき。要GPU/時間（なくても動くが時間はかかる）。

1. セットアップ

# LDA（scikit-learn）に必要
pip install pandas janome scikit-learn plotly

# BERTopic に必要（重め：UMAP/HDBSCAN/埋め込み）
pip install bertopic[visualization] umap-learn hdbscan sentence-transformers

import re, unicodedata
from collections import Counter
import pandas as pd
from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import plotly.express as px
import numpy as np

2. 日本語前処理（表記ゆれ＋STOPWORDS）

前記事と同様の正規化／ストップワードを使用。必要に応じて調整してください。

# CSV読み込み
df = pd.read_csv("reviews_genshin_paged.csv")
df = df.dropna(subset=["content","score"]).copy()
df["score"] = df["score"].astype(int)

# 形態素解析・正規化
t = Tokenizer(wakati=False)
NORMALIZE_MAP = {"出来る":"できる","出来":"できる","ゲー":"ゲーム","スマ":"スマホ",
                 "おもろい":"面白い","言う":"いう","カク":"カクつく"}
def normalize_surface(s:str)->str:
    s = unicodedata.normalize("NFKC", s)
    s = NORMALIZE_MAP.get(s, s)
    s = re.sub(r"ー{2,}", "ー", s)
    return s

def tokenize_ja(text:str)->list[str]:
    text = re.sub(r"http\S+|www\.\S+"," ", str(text))
    text = re.sub(r"[0-9０-９]+"," ", text)
    text = re.sub(r"[^\wぁ-んァ-ン一-龥ー]"," ", text)
    toks=[]
    for token in t.tokenize(text):
        base = token.base_form if token.base_form!="*" else token.surface
        base = normalize_surface(base)
        pos = token.part_of_speech.split(",")[0]
        if pos in {"名詞","形容詞","動詞"} and base and len(base)>1:
            toks.append(base)
    return toks

# STOPWORDS（汎用＋ドメイン＋自動学習）
base_stop = {"する","なる","いる","ある","できる","てる","やる","れる","出来る","いう","思う","多い","すぎる"}
domain_stop_generic = {"ゲーム","ゲー","スマ","スマホ","面白い","楽しい","最高","いい","良い","悪い","最悪","普通","神"}
domain_keepwords = {"ガチャ","課金","キャラ","ストーリー","バグ","音楽","UI","重い","容量"}

df["tokens_raw"] = df["content"].map(tokenize_ja)

def learn_corpus_stopwords(tokens_series, df_threshold=0.3, keepwords=None):
    keepwords = keepwords or set(); N=len(tokens_series)
    df_counts=Counter()
    for toks in tokens_series: df_counts.update(set(toks))
    return {w for w,c in df_counts.items() if c/N>=df_threshold and w not in keepwords}

auto_stop = learn_corpus_stopwords(df["tokens_raw"], 0.3, domain_keepwords)
STOPWORDS = base_stop | domain_stop_generic | auto_stop

def filter_tokens(tokens): return [w for w in tokens if w not in STOPWORDS]
df["tokens"] = df["tokens_raw"].map(filter_tokens)

# 文（スペース区切り）形
df["doc"] = df["tokens"].map(lambda xs: " ".join(xs))
docs = df["doc"].tolist()

# 注意：STOPWORDSはタイトルにより異なります。初回で完璧は難しいため、結果を見ながら調整してください。

3. LDA（Latent Dirichlet Allocation）

コーパスをBoWでベクトル化し、n_topics=10 個の「話題」に確率分解します。

3-1. 実行コード（K=10＋ドーナツグラフ）

# CountVectorizer（日本語トークナイザ使用）
def tokenizer_for_vectorizer(text):  # text は前処理済み doc だが保険として
    return [w for w in text.split() if w and w not in STOPWORDS]

vectorizer = CountVectorizer(
    tokenizer=tokenizer_for_vectorizer, token_pattern=None,
    min_df=5, max_df=0.9, ngram_range=(1,1)   # LDAは 1-gram 推奨
)
X = vectorizer.fit_transform(docs)
vocab = np.array(vectorizer.get_feature_names_out())

# ---- LDA: K=10 ----
n_topics = 10
lda = LatentDirichletAllocation(
    n_components=n_topics, learning_method="batch",
    random_state=42, max_iter=20, evaluate_every=5
)
W = lda.fit_transform(X)   # 文書×トピックの確率
H = lda.components_        # トピック×語の重み

# トピック上位語を出力
topn = 10
topic_words = []
for k in range(n_topics):
    idx = H[k].argsort()[::-1][:topn]
    words = vocab[idx]
    topic_words.append((k, ", ".join(words)))
topic_df = pd.DataFrame(topic_words, columns=["topic_id","top_words"])
topic_df.to_csv("lda_topics_k10.csv", index=False, encoding="utf-8-sig")
print(topic_df)

# 文書ごとの主トピック
doc_topic = pd.DataFrame({"topic_id": W.argmax(axis=1), "prob": W.max(axis=1)})
df_lda = pd.concat([df[["content","score"]].reset_index(drop=True), doc_topic], axis=1)
df_lda.to_csv("lda_document_topics.csv", index=False, encoding="utf-8-sig")

# ---- トピック比率（ドーナツ） ----
ratio = doc_topic["topic_id"].value_counts(normalize=True).sort_index().rename("ratio").reset_index()
ratio.columns = ["topic_id","ratio"]

# 降順にして希望順リストを用意
ratio_sorted = ratio.sort_values("ratio", ascending=False).copy()
ratio_sorted["topic_id"] = ratio_sorted["topic_id"].astype(str)
order = ratio_sorted["topic_id"].tolist()

import plotly.express as px

fig = px.pie(
    ratio_sorted,
    names="topic_id",
    values="ratio",
    title="LDA（K=10）：トピック比率（降順）",
    hole=0.6,
    category_orders={"topic_id": order},  # ← 順序を固定
)

# 自動並び替えを完全OFF
fig.update_traces(sort=False, textposition="inside", texttemplate="%{percent:.1%}")

# 凡例もDF順に
fig.update_layout(template="plotly_white", height=700, legend_traceorder="normal")

fig.show()
fig.write_html("lda_k10_topic_ratio_desc.html", include_plotlyjs="cdn")

LDA（K=10）：トピック比率（降順）。ドーナツの並びは比率の大きい順です。

3-2. トピック上位語（サンプル）と要約解釈

topic_id                                          top_words
0         0  ストーリー, キャラ, グラフィック, 綺麗, キャラクター, すごい, 好き, 世界, こ...
1         1      キャラ, 課金, ガチャ, こと, 強い, ない, 育成, イベント, ストーリー, 武器
2         2        ワールド, オープン, せる, 運営, こと, これ, ない, 楽しむ, 期待, 要素
3         3  容量, 重い, ダウンロード, 時間, gb, アプリ, インストール, 大きい, データ,...
4         4   プレイ, pc, ps, カクつく, android, 重い, 操作, スペック, 対応, 端末
5         5         操作, 戦闘, 移動, 攻撃, ない, ボタン, にくい, よう, 難しい, 慣れる
6         6     ランク, 世界, レベル, 冒険, 上がる, 可愛い, 上げる, 素材, game, the
7         7    ログイン, 画面, ムービー, よう, バグ, アカウント, 欲しい, しまう, 最初, 機能
8         8        ガチャ, 課金, 高い, 出る, クオリティ, 無料, キャラ, ない, 渋い, 天井
9         9         ゼルダ, 樹脂, ゴミ, 評価, 回復, それ, 情報, これ, レビュー, みたい

Topic 0（総評・世界観）：ストーリー／キャラ／グラフィックを褒める全体評価系。
Topic 1（育成・課金）：キャラ育成・武器・イベント・課金バランスの議論。
Topic 2（OW体験・期待）：オープンワールドの没入感、運営・要素への期待/所感。
Topic 3（容量・DL時間）：容量の大きさ・DL/パッチ時間・通信/端末負荷。
Topic 4（操作性×環境）：PC/PS/Androidなど環境別の操作・パフォーマンス談義。
Topic 5（操作難・戦闘UI）：操作のしづらさ、戦闘・ボタン配置へのフィードバック。
Topic 6（育成進行・素材）：冒険ランク・素材集め・可愛い等、進行/収集トピック。
Topic 7（ログイン・不具合）：ログイン/初期画面/ムービー、バグ・アカウント問題。
Topic 8（ガチャ・価格感）：ガチャの渋さ/天井/価格感、無料とクオリティの対比。
Topic 9（他作比較・用語）：「ゼルダ」など他作比較・メタ的話題、スラング混在。

※ トピック数は複数試し、「上位語に意味が通る」「粒度が適切」な値を採用するのが実務的です。

4. BERTopic（BERT埋め込み＋クラスタリング）

文章を多言語対応の埋め込みに変換し、クラスタリングでトピック化します。言い換え・同義表現に強いのが特徴です。

4-1. 実行コード（エラー回避版）

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# 多言語モデル（日本語対応・軽量）
embed_model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

# docs は前処理済みのスペース区切り文字列の配列（例: docs = df["doc"].tolist()）
cv = CountVectorizer(
    tokenizer=str.split,
    token_pattern=None,
    min_df=5, max_df=0.9,
    ngram_range=(1,2),
    stop_words=None           # ← STOPWORDSは前処理で除去済みの想定
)

topic_model = BERTopic(
    embedding_model=embed_model,
    language="multilingual",
    vectorizer_model=cv,
    calculate_probabilities=True,
    verbose=True,
    seed_topic_list=None,
    nr_topics=None            # 自動。統合したい場合は "auto" や整数
)

topics, probs = topic_model.fit_transform(docs)

# トピック情報一覧
info = topic_model.get_topic_info()
info.to_csv("bertopic_topics.csv", index=False, encoding="utf-8-sig")
print(info.head())

# 各トピックの上位語（先頭10件）
for tid in info["Topic"].head(10):
    if tid == -1:  # outlier
        continue
    print(tid, topic_model.get_topic(tid)[:10])

# 文書ごとの割当
docinfo = topic_model.get_document_info(docs)
docinfo = pd.concat([df[["content","score"]].reset_index(drop=True), docinfo], axis=1)
docinfo.to_csv("bertopic_document_topics.csv", index=False, encoding="utf-8-sig")

# 可視化（Plotly）— HTML出力
fig_ts = topic_model.visualize_topics()
fig_ts.write_html("bertopic_visualize_topics.html", include_plotlyjs="cdn")

fig_hier = topic_model.visualize_hierarchy()
fig_hier.write_html("bertopic_hierarchy.html", include_plotlyjs="cdn")

fig_bar = topic_model.visualize_barchart(top_n_topics=12)
fig_bar.write_html("bertopic_barchart.html", include_plotlyjs="cdn")

4-2. アウトプット例と要約解釈

   Topic  Count                                          Name
0     -1   6954                        -1_ストーリー_キャラ_グラフィック_容量
1      0   2373                            0_オリジナル_バーバラ_of_it
2      1    792                              1_キャラ_イベント_強い_育成
3      2    333  2_android_android コントローラー_コントローラー_コントローラー 対応
4      3    184                        3_pc_pc プレイ_パソコン_重い pc

Topic -1（外れ値クラスタ）：あらゆる語が混在する「その他」。件数が多い場合、-1のみで再学習すると有意な話題に昇格することあり。
Topic 0（固有名詞・イベント）：「オリジナル／バーバラ」など固有名詞系。特定キャラ・イベント談義。
Topic 1（育成・イベント・課金）：キャラ強化／イベント／武器／天井。LDAの Topic 1/8 と整合性高め。
Topic 2（Android×コントローラー）：端末依存・入力デバイス対応。操作可否・対応状況の共有。
Topic 3（PC関連の操作・重さ）：PCプレイ／重さ／課金導線など、PC版体験談。
Topic 5（ダウンロード時間・初期DL）：初回DLの長さ、データ量、時間コストへの不満。
Topic 6（マルチプラットフォーム）：PS/Switch/PC連携・共有・操作感の比較。
Topic 7（RPG/OW体験）：RPGとしての完成度・オープンワールド体験。
Topic 8（国籍・セキュリティ懸念）：「中国」「スパイ」「個人情報」など政治・セキュリティ系の話題。

※ BERTopicは「意味ベクトルの近さ」でクラスタ化するため、LDAよりも端末・デバイス・固有名詞・政治的論点の分離が得意です。

5. 時系列×トピック（いつ増えた？）

抽出したトピックを月次に集計し、どの話題がいつ増えたかを確認します。

# 投稿日 → 月
df["at"] = pd.to_datetime(df["at"], errors="coerce")
df = df.dropna(subset=["at"])
df["month"] = df["at"].dt.to_period("M").astype(str)

# LDAの主トピック × 月
df_lda["month"] = df["month"].values
lda_month = df_lda.groupby(["month","topic_id"]).size().reset_index(name="count")
lda_month["ratio"] = lda_month.groupby("month")["count"].transform(lambda x: x/x.sum())

fig = px.area(lda_month, x="month", y="ratio", color="topic_id",
              title="LDA: トピック比率の月次推移", groupnorm="fraction")
fig.update_layout(template="plotly_white")
fig.write_html("lda_topics_by_month.html", include_plotlyjs="cdn")

# BERTopic（docinfoの "Topic" 列）× 月
docinfo["month"] = df["month"].values
bt_month = docinfo[docinfo["Topic"]!=-1].groupby(["month","Topic"]).size().reset_index(name="count")
bt_month["ratio"] = bt_month.groupby("month")["count"].transform(lambda x: x/x.sum())

fig2 = px.area(bt_month, x="month", y="ratio", color="Topic",
               title="BERTopic: トピック比率の月次推移", groupnorm="fraction")
fig2.update_layout(template="plotly_white")
fig2.write_html("bertopic_topics_by_month.html", include_plotlyjs="cdn")

注意：ハイパーパラメータと運用

STOPWORDSの調整： 初回で完璧は難しいため、結果を見ながら段階的に更新。
ランダム性： LDA・BERTopicともに初期値で結果が揺れます。random_stateやseedを固定。
トピック数： LDAは5〜12で複数試し、解釈しやすい粒度を採用。BERTopicはnr_topicsで自動／統合。
解釈と命名： 抽出された上位語をもとに、人間がトピック名を付けるのが実務的です。
速度対策： まずサンプルで検証→閾値（min_df）やngram_range調整→本番全量。

まとめ

LDA： 軽量・高速。BoWベースでレビューの主な話題構造を俯瞰。今回はK=10でドーナツグラフにより比率を直感可視化。
BERTopic： 文脈ベースで類似文をまとめ、端末・固有名詞・政治的論点など具体的文脈の塊を抽出。
月次集計と組み合わせることで、特定の時期に増えた話題を特定しやすい。
STOPWORDSと表記ゆれ正規化を段階的に調整し、タイトルに合わせて精度を育てる。
教師なし学習に「唯一の正解」はありません。上位語と代表文を読んで解釈・命名し、改善施策やレポート設計に落とし込むのがゴールです。