arXiv	https://arxiv.org/abs/2411.15611
論文のライセンス	http://creativecommons.org/licenses/by/4.0/

Knowledge Transfer Across Modalities with Natural Language Supervision

Carlo Alberto Barbano
University of Turin
[email protected] Luca Molinaro
University of Turin
[email protected] Emanuele Aiello
Politecnico di Torino
[email protected] Marco Grangetto
University of Turin
[email protected]

Abstract

我々は、テキストによる説明のみを用いて新しい概念を学習する方法を提示する。我々はこの手法を知識転移と呼ぶ。人間の知覚と同様に、我々はクロスモーダルな相互作用を活用して新しい概念を導入する。我々は、事前学習された視覚エンコーダーには、未知の高次概念を記述するのに十分な低次特徴（例えば、形状、外観、色など）がすでに学習されているという仮説を立てる。新規概念のテキスト説明が与えられると、我々の手法は視覚エンコーダーの既知の低次特徴をその高次のテキスト説明に整合させることで機能する。知識転移が、対象概念の単一の説明のみを必要とする非常に効率的な方法で、マルチモーダルモデルに新規概念を導入できることを示す。我々のアプローチは、テキストエンコーダーと視覚エンコーダーが分離されているモデル（例えばCLIP）とモダリティ間でパラメータを共有するモデルの両方に適合する。また、同じ原理に従って、知識転移がモデルにすでに知られている概念を改善できることも示す。知識転移を活用することで、分類、セグメンテーション、画像-テキスト検索、キャプション生成など、さまざまなタスクにおけるゼロショット性能を向上させる。

1 Introduction

視覚を得た盲人が、それまで触覚でのみ知っていた物体を認識できるだろうか？これは1668年にウィリアム・モリヌークスがジョン・ロックに提起した哲学的な謎であり[37]、数十年にわたって視覚神経科学において関連性を持ち続けてきた。最近の研究では、視力回復直後にはこれは起こらないものの、人間の被験者では数日以内に急速にクロスモーダルマッピングが発達することが示されている[14]。マルチモーダルニューラルネットワークに関する最近の研究では、このクロスモーダルな相互作用に焦点が当てられてきたが[50]、本稿では我々は、モデルがすでに世界に関する何らかの視覚的知識を持っているという、やや改訂されたバージョンのモリヌークスの謎に答えることを目指す。我々は、低レベルの視覚的特徴に関する事前知識があれば、説明的なテキスト記述が提供された場合に、未知の概念の妥当な視覚的表現を生成するのに十分であると仮定する。この事前知識は、マルチモーダルな事前学習によって得ることができ、例えばCLIPやその他の類似の研究で行われているような画像-テキストのアラインメントを用いることで獲得できる[45, 11, 61, 26]。

自然言語による監督を活用して新しい視覚的概念を学習するプロセスを、我々は知識転移と呼ぶ。知識転移の目標を示す例示的な例がFig. 1に示されており、そこではCLIPベースのゼロショット分類器が未知の概念に直面している。我々は、明示的または暗黙的に知識転移を達成する2つの可能な方法を提案する。明示的知識転移では、新しい概念のテキスト記述から始めて、モデル反転[24]を通じて一致する画像を合成し、後にビジュアル-テキストマッチング損失でモデルを微調整するために使用できる。一方、暗黙的知識転移は、マルチモーダルニューロン[50]に依存し、テキストキャプションのみを使用してマスク言語モデリングでモデルを微調整する可能性がある。ただし、これには視覚エンコーダーとテキストエンコーダー間でパラメータを共有する必要がある。本稿では、モデルアーキテクチャに対する要件がより厳密でない明示的知識転移に焦点を当てる。我々の発見は以下の通りである：

1.

テキスト記述のみで、事前学習された視覚モデルに新しい概念を導入することに成功できる。
2.

知識転移は、既存の概念に対する視覚的精度も向上させることができる。
3.

知識転移は、分類、セグメンテーション、画像-テキスト検索などのゼロショットダウンストリームタスクを改善し、ドメイン外の一般化の可能性を示す。

Refer to caption — (a) CLIP (B) ゼロショット予測上位3件：（「凱旋門」、「石壁」、「鋼鉄製アーチ橋」）
知識転移を伴うCLIP (B) ゼロショット予測上位3件：（「月門」、「凱旋門」、「石壁」）

2 Related Works

マルチモーダル表現学習の研究は、異なるモダリティ（例えば、視覚的表現と言語的表現）間のギャップを埋めることを目的としており、モデルがそれらを共同で処理できるようにすることを目指している。CLIP [45]、CoCa [61]、Flamingo [3]、ImageBind [11]などの視覚言語モデル（VLM）は、視覚的特徴と言語的特徴を共有埋め込み空間で整列させ、様々な視覚タスクにおけるゼロショット学習やフューショット学習を可能にしている。 VLMがクロスモーダル情報を内部的にどのように処理するかについての理解は、マルチモーダルニューロンの研究によって進展してきた [12, 50, 43]。Schwettmannら [50]は、視覚的モダリティと言語的モダリティを統合する特定のニューロンを特定し、モデルの解釈可能性を向上させた。この洞察は、マルチモーダルニューロンの存在を強調することで、我々の手法の背後にある直感を導いている。

Cross-Modal Knowledge Transfer

クロスモーダル知識蒸留 [13, 53, 18] は、異なるモダリティ間で知識を転移し、表現を豊かにするための戦略である。VidLANDK [53] やC2KD [18] のような手法は、ゼロショットおよび少数ショットのシナリオにおける汎化性能を向上させるために、モダリティ間を橋渡しする技術を採用している。これらのアプローチは通常、大量のマルチモーダルデータと複雑な訓練手順を必要とする。対照的に、我々の手法は、テキストによる説明を利用して最小限のデータで新しい視覚概念を導入し、大規模なマルチモーダルデータセットを必要とせずに効率的な知識統合に焦点を当てている。代替アプローチ [7, 65, 20] には、識別モデルを訓練するための合成データ生成が含まれる。例えば [65] は、Stable Diffusion [47] を活用して多様な訓練サンプルを作成し、データ不足の問題に対処している。この手法は効果的であるが、生成されたデータの品質と多様性に依存する。我々のアプローチは、テキストから画像への生成モデルの外部知識や計算コストの高いデータ生成パイプラインに依存せずに、既存のモデルに新しい概念を統合する点で異なっている。さらに、視覚的理解を向上させるためのテキストのみの訓練方法も提案されている。例えば、CapDec [41] は、CLIPにおける視覚エンコーダとテキストエンコーダの整合性を活用し、テキストのキャプションデータのみを使用してキャプション生成を改善している。一方、我々のアプローチは、自由形式のテキスト説明を活用することで、未知の概念に対してそのような整合性を達成することを目指している。

最近では、少数ショットのクロスモーダル学習が探求されており [35]、複数のモダリティからの手がかりを統合することで概念学習を強化できることが示されている。これは人間の学習を反映している。彼らのアプローチは、ペアになったマルチモーダルデータの少数ショット例を活用して、単一モーダルのダウンストリームタスクを強化している。本稿では、それらとは異なり、新しい視覚知識を導入するために単一モーダルのテキストデータを使用している。

3 Method

本節では、知識転移のための我々の提案手法を紹介する。我々は、より広範な異なるアーキテクチャに適用可能であるため、反転に基づく明示的知識転移に焦点を当てる¹¹完全を期すため、暗黙的転移の簡潔な説明を補足資料に記載している。。図2に本手法の図解概要を示す。

3.1 Explicit Knowledge Transfer

テキストエンコーダーを $f_{T}:\mathbb{R}^{L}\rightarrow\mathbb{R}^{n}$ （ここで $L$ は系列長）、視覚エンコーダーを $f_{V}:\mathbb{R}^{w\times h}\rightarrow\mathbb{R}^{n}$ （ $w$ と $h$ は画像サイズ）とする。我々の目標は、テキストモダリティのみを用いて $f_{T}$ を通じて $f_{V}$ に新しい概念を導入することである。 $X_{T}$ を学習したい新しい概念に関する対応のないキャプションの集合とし、 $X_{V}^{*}$ をその概念に対応する理想的な正解画像の集合とする。我々が達成したいのは以下のことである：

\begin{split}\underbrace{sim(f_{v}(x_{v}^{*}),f_{t}(x_{t}))}_{s_{t}}-% \underbrace{sim(f_{v}(x_{v}^{*}),f_{t}(x_{k}))}_{s_{k}}>0\\ \forall x_{v}^{*}\in X_{V}^{*},\,x_{t}\in X_{T},\,x_{k}\in X_{K}\end{split}

(1)

ここで $X_{K}$ は他の概念に関する全てのキャプションの集合（ $X_{K}\,\cap\,X_{T}=\varnothing$ ）である。式1の条件は、全ての理想的な視覚サンプルが、他の全てのキャプションよりも真の対応するキャプションにより近くマッピングされるべきことを意味している。実際には、 $X^{*}_{V}$ が利用可能であれば、その近似を最適化することで式1を満たすことができる[4]：

\min_{f_{V}}-\frac{1}{|X^{*}_{V}|}\sum_{x^{*}_{v}}\frac{1}{|X_{T}|}\sum_{x_{t}% }\log\frac{\exp(s_{t})}{\exp(s_{t})+\sum_{x_{k}}\exp(s_{k})}\\

(2)

これはInfoNCE損失に対応する[45, 6, 25]。しかし、我々の設定では $X^{*}_{V}$ は利用できないため、それを推定し（例えばモデル反転[24]を用いて）、推定値を使用して $f_{V}$ と $f_{T}$ をInfoNCEを用いた対比的アプローチで共同訓練することができる[45, 11]。

実践的な例として、ムーンゲートの概念に対するキャプション $x_{t}$ は、図3に示されているように、「均一に切られた石やレンガで作られた完全な円形のアーチで、より大きな壁に組み込まれている[...]」となる可能性がある。序論で述べたように、この方法は視覚エンコーダーがすでにキャプションに含まれる低レベルの視覚属性に関する事前知識を持っている場合、例えばテキストエンコーダーと共同で事前訓練されている場合に機能する。分析すべき興味深い点は、成功的な知識転移に必要な事前知識の量である：我々はこの問題を今後の研究課題として残す。

3.1.1 Estimating $X^{*}_{v}$ by inversion

$X^{*}_{v}$ を推定する最も直接的な方法は、 $X_{T}$ のテキスト埋め込みから始めて視覚エンコーダー $f_{V}$ を反転させることで計算し、近似 $\hat{X}^{*}_{V}\approx X^{*}_{V}$ を得ることである。これを行うために、我々はランダムノイズから始めて以下の最適化問題を解く：

\begin{split}\hat{X}^{*}_{V}=f_{V}^{-1}(f_{T}(X_{T}))\qquad\qquad\qquad\qquad% \qquad\qquad\\ \approx\max_{\hat{X}^{*}_{V}}sim(A(f_{V}(\hat{X}^{*}_{V})),f_{T}(X_{T}))+% \alpha R(\hat{X}^{*}_{V})\end{split}

(3)

ここで、[24]と同様に、 $f^{-1}$ は反転演算子、 $A$ は各ステップで適用されるランダムな拡張操作（例：ランダムアフィン）、 $R$ は全変動（TV）[40]に基づく正則化項で、 $\alpha$ で重み付けされる。拡張と正則化は、より自然に見える画像の生成に役立つ。

生じうる疑問の1つは、なぜテキストの説明に基づいて訓練画像を合成するために生成モデル（例えば DALL-E [46]）を使用しないのかということである。理由は2つある：i.) そのような生成モデルの訓練データセットを制御できないため、我々が対象とする概念がすでに含まれているかどうかわからない；ii.) 訓練セットを拡張するために外部の生成モデルを使用すること[48, 63, 20, 65, 7]は、単一のモデルの知識をテキストモダリティから視覚モダリティに転移させるという我々の研究課題の範囲外である。

3.1.2 Finetuning on the new concepts

モデル反転を通じて画像 $\hat{X}^{*}_{V}$ が合成された後、それらを使用してInfoNCE（式2）のような画像-テキスト整合性目的関数で $f_{v}$ と $f_{T}$ を訓練することができる。視覚特徴を望ましい概念に成功的にマッチさせるために、我々は対応する概念名を各キャプションの先頭に付加することで $X_{T}$ を拡張する。先に示した例では、ファインチューニングのキャプションは「ムーンゲートは、均一に切られた石で作られた完全な円形のアーチ[...]」と表現される。このステップは、最終的に低レベルの視覚特徴を高レベルの概念自体にマッピングするために必要である。

4 Experiments

異なるデータセットและドメインにおける広範な評価を通じて、我々はKnowledge Transferの可能性を厳密に評価することを目指している。本節は2つの部分に分かれている：最初の部分では、これまで未知であった全く新しい概念の学習に焦点を当て、2つ目の部分ではゼロショットのダウンストリームタスクのパフォーマンス向上に焦点を当てる。

(a) ムーンゲート。キャプション：均一に切られた石やレンガで作られた、より大きな壁に組み込まれた完全な円形のアーチ。滑らかな円を形成し、その向こうの庭園や風景の眺めを枠付け、絵のような入り口を作り出している。

(b) 眼圧計。キャプション：精密なダイヤルとゲージを備えた小さな基部に取り付けられた細長いペン状のプローブ。このツールは、金属仕上げと洗練された専門的な外観を特徴とする、より大きな医療機器の一部であることが多い。

図3: CLIPが正しく分類するのに苦労する希少な概念の反転画像（上）と実際の画像（下）の例。

4.1 Datasets

我々は、様々な領域および異なるダウンストリームタスクにおいて多様なデータセットを使用する。ここでは、タスク別に完全なリストを提供する。これらのデータセットからは訓練データを一切使用せず、テストにのみ使用することに注意されたい。すべての改善は、テキストによる説明から得られたものである。
自然画像分類
1.) RareConceptsはウェブから収集された珍しい概念の画像コレクションである。我々は本稿の一部としてこのデータセットを公開する。我々の実験では、異なる大規模マルチモーダルアーキテクチャにとって比較的未知である3つの概念（ムーンゲート、ジャイロスコープ、眼圧計）に焦点を当てる。各概念について10枚の画像を収集した。
2.) ImageNet-1k [8]は視覚認識のための大規模ベンチマークであり、1000クラスと320万の自然画像を含む。
医療画像分類
3.) CheXpert-2x500c [17]は大規模なCheXpertデータセット[19]から得られた胸部X線画像のデータセットであり、無気肺、心拡大、浮腫、肺炎、胸水のクラスについて各200例を考慮している。
4.) JSRT [52]は、異なるタイプ（悪性および良性結節）の肺結節を含む154枚の従来型胸部X線画像からなるデータセットである。
医療画像セグメンテーション
5.) UnitoChest [5]は306,440枚の胸部CTスライスと結節のセグメンテーションマスクのコレクションである。我々は結節が存在するスライスを考慮し、合計4179枚の画像を使用する。
6.) UDIAT [59]は超音波画像における乳房腫瘤のデータセットであり、110例の良性と54例の悪性症例を含む。
7.) SIIM Pneumothorax [62]は気胸のセグメンテーションのための胸部X線データセットであり、2019年にチャレンジとして公開された。我々は合計500枚の画像を考慮する。
8.) BraTS23 Glioma [1]は脳神経膠腫を有する成人患者の脳MRIデータセットである。我々は腫瘍が存在するすべてのスライスを考慮し、合計14,746枚の画像を使用する。
画像-テキスト検索および画像キャプション生成
9.) Flickr30k [60]はFlickrから収集された31,783枚の画像からなるデータセットであり、各画像に人間のアノテーターによって提供された5つのキャプションが関連付けられている。我々の実験では、Karpathyのテスト分割[23]を使用し、これは1000枚の画像と5000のキャプションを含む。
10.) MSCOCO [34]は33万枚以上の画像とテキストキャプションを含む大規模データセットである。我々はKarpathyのテスト分割[23]を使用し、これは5000枚の画像を含む。

4.2 Setup

Captioning

新しい概念の説明的なキャプションを生成するために、我々はLLMベースのアプローチを採用する。具体的には、自然画像に対しては、Llama-3 Instruct（8Bパラメータ）[2]を以下のプロンプトで使用する：「ImageNetクラス<クラス名>の簡潔な説明を、その単語自体を使わずに生成してください。説明には、被写体を認識するのに役立つ視覚的手がかりを、低レベルで正確な詳細とともに含める必要があります。説明以外は何も回答に含めないでください。」ここで、各新概念に適切なクラス名を挿入する。なお、我々がLLMを使用するのは利便性のためであり（例：ImageNetの1000クラス全てにキャプションを付ける）、これは必須ではない。医療データに関しては、実際にRadiopaedia [54]に基づく手作りのキャプションとChatGPT-4 [42]からの要素を組み合わせて使用している。全てのキャプションは補足資料に記載されている。
反転我々は5000ステップの反転を実行し、コサイン学習率アニーリングスケジュールを使用する。正則化項には、デフォルト値 $\alpha=0.005$ [24]を使用する。採用する拡張は、ランダムなアフィン変換（-30度から+30度の回転、10%の平行移動、画像サイズの70%から100%のスケーリング）で構成され、確率は0.5である。反転された画像の例は図3に示されている。各概念について、10個の反転サンプルを生成する。
ファインチューニングファインチューニングは、反転された画像とテキスト説明の間の整合性を達成するために、InfoNCE損失（式2）を用いて実行される。我々は視覚エンコーダのみをファインチューニングし、テキストエンコーダは固定したままにする。その理由は、視覚エンコーダから抽出された特徴をテキストエンコーダから抽出された特徴に整合させたいからである。ほとんどの実験では、 $10^{-6}$ から $10^{-4}$ の間の小さな学習率で、わずか1エポックの迅速なファインチューニングを実行する。CLIPベースのモデルに対しては、一般的に[45]と同様に0.2の重み減衰を採用する。より詳細な情報は各実験の説明で提供される。

Model	Concept		Baseline	1e-5	2e-5	3e-5	4e-5	5e-5
				Learning Rate
CLIP ViT-B/32 [45]	Moongate	Target Acc.	0%	10%	60%	90%	100%	100%
		ImageNet 0-shot	58.10%	57.78%	56.43%	53.95%	50.37%	42.30%
	Tonometer	Target Acc.	50%	80%	80%	100%	100%	100%
		ImageNet 0-shot	58.10%	57.52%	55.62%	51.98%	42.80%	23.73%
	Gyroscope	Target Acc.	90%	100%	100%	100%	100%	100%
		ImageNet 0-shot	58.10%	57.86%	56.84%	53.96%	48.28%	34.48%
CLIP ViT-L/14 [45]	Moongate	Target Acc.	78.95%	78.95%	100%	100%	100%	100%
		ImageNet 0-shot	70.79%	70.74%	70.51%	69.96%	68.57%	62.35%
	Tonometer	Target Acc.	31.58%	52.63%	78.95%	100%	100%	100%
		ImageNet 0-shot	70.79%	70.74%	70.61%	70.08%	69.06%	66.92%
	Gyroscope	Target Acc.	90%	90%	100%	100%	100%	100%
		ImageNet 0-shot	70.79%	70.65%	70.42%	69.84%	69.39%	68.35%
ViLT [26]	Moongate	Target Acc.	0%	0%	0%	0%	0%	0%
		ImageNet* 0-shot	23.74%	23.90%	24.02%	24.16%	24.18%	24.16%
	Tonometer	Target Acc.	10%	30%	30%	30%	40%	40%
		ImageNet* 0-shot	23.74%	23.88%	24.02%	24.04%	24.22%	23.94%
	Gyroscope	Target Acc.	50%	60%	50%	50%	40%	30%
		ImageNet* 0-shot	23.74%	23.80%	23.88%	23.72%	23.38%	23.12%

表1: 新規および稀少な概念に関する知識転移（CLIPとViLT）。* ViLTについては、ゼロショット分類のために可能な全ての画像-キャプションペアを評価する計算要件のため、ImageNet-100 [22]を使用する。

4.3 Learning novel concepts

本実験セクションの最初では、モデルが知らない新規概念の学習に焦点を当てる。Knowledge Transferの最初のデモンストレションとして、我々はRareConceptsデータセットを使用する。これは、ムーンゲート、眼圧計、ジャイロスコープという3つの珍しいクラスで構成されている。これらのクラスは、CLIPベースのモデルを様々な潜在的に珍しい概念でウェブから手動でプローブすることで選択された。我々は、CLIPのベースおよびラージ（ViT-B/32とViT-L/14に基づく）の2つのバリアントと、ViT-B/32に基づく共有パラメータアーキテクチャViLT [26]に対してKnowledge Transferを適用する。CLIPについては、OpenAIが公開している公式の事前学習済みモデル²²2https://github.com/openai/CLIPを使用し、反転ファインチューニングのセットアップは4.2節で説明した通りである。ViLTについては、アーキテクチャの違いに対応するため、若干異なるアプローチを用いる。反転を実行するために、テキストキャプションとランダムノイズで構成される入力ペア $<x_{t},\hat{x}^{*}_{v}\sim N(0;1)>$ から開始する。その後、ViLTのITMヘッドで計算されるイメージ-テキストマッチング（ITM）スコアを最大化することで $\hat{x}^{*}_{v}$ を最適化する[26]。セットアップの残りの部分はCLIPと同じである。詳細は補足資料に記載されている。

結果は表1に示されている。我々は、Knowledge Transfer適用前後のゼロショット分類精度を評価する。反転に使用したキャプションの一部は図3に示されている。すべてのキャプションは補足資料に記載されている。まず、ベースラインとして示される事前学習済みモデルが異なる未知の概念を示していることに注目する：CLIP ViT-B/32はムーンゲート（0%の精度）と眼圧計で苦戦し、CLIP ViT-L/14は眼圧計で苦戦し、ViLTは全体的に最も低い精度を示し、ムーンゲートと眼圧計でより苦戦している。我々は、異なるファインチューニング学習率でのKnowledge Transferの結果を報告する。全体として、モデルは各概念のゼロショット分類精度の向上によって示されるように、新規概念を成功裏に学習している。さらに、ファインチューニングが以前の知識の破滅的忘却につながるかどうかを評価するために、各ファインチューニングされたモデルのImageNetでの精度も報告する[28]。適切な学習率の選択により、ImageNetでの同等の結果を維持しながら、ターゲット精度の向上を達成している。特筆すべきは、一部の概念で100%を達成しながら、ImageNetで同等の結果を維持していることである（例えば、CLIPのベースとラージ）。ターゲット精度が向上しない唯一の例は、ムーンゲートに対するViLTで、0%のままである。しかし、すべての概念でImageNetでわずかな向上が見られ、Knowledge Transferが既存の概念の表現の改善につながる可能性があることを示唆している。

4.3.1 Ablation study

我々のファインチューニング戦略についてアブレーション研究を行う。我々の実験では、ファインチューニング中にテキストエンコーダーを凍結し、視覚エンコーダーのみを訓練する。ここでは、異なる構成でのファインチューニングを評価する。結果は図4に示されている。両方のエンコーダーをファインチューニングすると、すべての概念でターゲット精度とImageNet精度の急速な崩壊が観察される。また、視覚エンコーダーを凍結したまま、テキストエンコーダーのみをファインチューニングした場合も同様の傾向が見られる。しかし、これは予想されることである。なぜなら、我々の仮定は、テキストエンコーダーに含まれる知識がすでにターゲット概念を表現するのに十分であり、我々はただ視覚特徴をそれに整合させたいだけだからである。さらに、テキストエンコーダーの重みを変更すると、キャプションと反転された画像との対応が失われ、退化したケースにつながる可能性がある。追加のアブレーション研究は補足資料で行っている。

4.3.2 Experiments with MedCLIP

次に、医療画像に対してKnowledge Transferを適用する。医療画像は、Knowledge Transferに最適なタスクである。なぜなら、テキスト形式の既存の医学知識（例えば、医学教科書や百科事典から）を活用して、胸部X線（CXR）、コンピュータ断層撮影（CT）スキャン、磁気共鳴画像（MRI）、超音波画像などの画像上の異なる病理の概念と視覚的外観を正確に記述できるからである。我々の実験は、MedCLIPアーキテクチャ[56]に基づいている。これは、テキストエンコーダーのバックボーンとしてBioClinicalBERT³³3https://huggingface.co/emilyalsentzer/Bio_ClinicalBERTを、視覚エンコーダーとしてSwin Transformer [36]を採用したCLIPベースのモデルである。MedCLIPは、CXR画像と放射線科レポートを含む大規模なMIMIC-CXR [21]およびCheXpert [19]データセットで事前学習されている。データセットに含まれる異なる概念は、無気肺、心拡大、浸潤影、浮腫、心縦隔拡大、骨折、肺病変、肺不透過、胸水、肺炎、気胸、支援装置である。我々の実験では、MedCLIPにCXRにおける良性および悪性結節の概念を導入することを目指す。我々はCLIPと同じ実験プロトコルに従い、外部データセットJSRT [52]でKnowledge Transferによって達成された性能を測定する。

JSRTにおけるゼロショット分類の結果は表2に示されている。反転に使用されたキャプションは補足資料に記載されている。以前と同様に、破滅的忘却の事例を特定するために、CheXpert-5x200cを使用して以前の知識に対するゼロショット精度を測定する。我々は、悪性結節の検出精度をベースラインの83.93%から92.86%に向上させ、同時にソースデータセットCheXpert-5x200cで同等の結果を維持している。良性結節については、おそらくCXR画像上の良性結節の特徴が悪性結節と比較して識別しにくいため、精度の向上に苦戦している。しかし、以前の実験でも注目されたように、ソースデータセットでの精度がわずかに向上しており、モデルの表現の改善を示唆している。

			Learning Rate (multiplier)
Concept		Baseline	$\times$ 1	$\times$ 2	$\times$ 3	$\times$ 4	$\times$ 5
Benign Nodule	Target Acc. (base lr 1e-5)	54.55%	54.55%	54.55%	54.55%	54.55%	54.55%
	CheXpert-5x200c 0-shot	62.10%	61.80%	62.30%	62.10%	62%	62.20%
Lung Cancer	Target Acc. (base lr 1e-4)	83.93%	87.50%	92.86%	94.64%	92.86%	92.86%
	CheXpert-5x200c 0-shot	62.10%	62.20%	61.50%	53.70%	48.20%	44.50%

表2: JSRTデータセットにおけるMedCLIPへのKnowledge Transfer。モデルはCXR画像上の悪性結節（肺がん）の新規概念を成功裏に学習している。一方、良性結節はCXRの他の所見と視覚的に区別するのが難しい。

MedCLIP (ViT)	Reference	49%	69.50%	32.50%	75.50%	84%	62.10%
	Model	Atelectasis	Cardiomegaly	Consolidation	Edema	Pleural Effusion	Top-1
CLIP ViT-B/32	Baseline	0%	2.5%	0%	0%	94.50%	19.40%
	Transfer	0%	21.5%	0%	0%	85%	21.30%
CLIP ViT-L/14	Baseline	59.50%	16.50%	0%	0%	35.50%	22.40%
	Transfer	4%	32.5%	0%	0%	92.5%	25.90%

表3: 異なるドメイン（自然画像から医療画像へ）での新規概念の学習は可能性を示している。CheXpert-5x200cでテスト。

4.3.3 Out of domain Knowledge Transfer

最後に、訓練ドメイン外の新規概念を導入するKnowledge Transferの可能性を評価する。具体的には、自然画像で訓練されたモデルに医療概念を導入することを目指す。この目的のために、我々はCLIPモデルをCheXpertの5つのクラス（無気肺、心拡大、浸潤影、浮腫、胸水）すべてでファインチューニングする。

結果は表3に示されている。我々は、CheXpertで訓練されたモデルの参考としてMedCLIPの性能を報告する。トップ1精度を見ると、CLIPの両バージョンで改善された結果が得られ、ラージバリアントでは22.40%から25.90%へのより大きな向上が見られた。しかし、クラスごとの精度の内訳を見ると、i.）開始精度が0%のクラスは改善されず、ii.）一部のクラス（すなわち胸水と無気肺）では性能が悪化したことが明らかになった。これは、モデルの事前知識（自然画像）と医療ドメイン特有の特徴との間のドメインギャップによるものかもしれない。それにもかかわらず、この制限を考慮すると、Knowledge Transferはゼロショットのドメイン外汎化に可能性を示している。

4.4 Improving zero-shot downstream tasks

実験の第二部では、新規および既知の概念の両方に焦点を当て、ゼロショット下流タスクのパフォーマンス改善に注力する。具体的には、セグメンテーション、画像-テキスト検索、およびキャプショニングを対象とする。

	Lung Nodules^†			Lung Pneumothorax^†			Breast Ultrasound			Brain MRI
Model	DSC	NSD	IoU	DSC	NSD	IoU	DSC	NSD	IoU	DSC	NSD	IoU
MedCLIP-SAMv2	14.83%	17.30%	8.64%	6.30%	7.61%	3.75%	56.25%	59.44%	47.81%	17.20%	20.97%	12.05%
Transf. (1e-5)	13.95%	17.45%	8.75%	6.28%	7.59%	3.77%	58.23%	61.56%	49.52%	15.90%	19.36%	11.10%
Transf. (2e-5)	14.10%	17.65%	8.83%	6.41%	7.76%	3.83%	54.36%	57.30%	46.30%	18.13%	22.26%	12.62%
Transf. (1e-4)	14.35%	18.03%	9.04%	6.02%	7.29%	3.59%	-	-	-	-	-	-

表4: ゼロショットセグメンテーションの改善。^†は元のMedCLIP-SAMv2の訓練データに含まれていない新規概念を示す[29]。セグメンテーションに使用されたプロンプトは以下の通りである: P1 肺内に様々な大きさの円形の斑点が見られる医療用胸部CTスキャン画像で、良性または悪性の結節を示唆している; P2 胸膜腔内に異常な空気の集まりが見られる医療用胸部X線画像で、気胸を示唆している; P3 不規則な形状の鋸歯状の腫瘤が見られる医療用乳房マンモグラム画像で、悪性乳房腫瘍を示唆している; P4 不規則な縁を持つ明るいまたは暗い腫瘤が見られる脳MRI画像で、脳腫瘍またはグリオーマを示唆している。

	Flickr30k (1K)
	Text Retrieval			Image Retrieval
Model	R@1	R@5	R@10	R@1	R@5	R@10
ViLBERT [38]	-	-	-	31.9%	61.1%	72.8%
Unicoder-VL [31]	64.3%	85.8%	92.3%	48.4%	76.0%	85.2%
ImageBERT [44]	70.7%	90.2%	94.0%	54.3%	79.6%	87.5%
ViLT-B/32 (original) [26]	73.2%	93.6%	96.5%	55.0%	82.5%	89.8%
ViLT-B/32 (huggingface)	73.8%	93.5%	96.5%	57.3%	83.9%	90.4%
ViLT-B/32 (transf. 9e-7)	74.6%	93.8%	96.4%	57.8%	84.0%	90.5%
ViLT-B/32 (transf. 2e-6)	74.6%	93.7%	96.5%	57.8%	84.0%	90.5%

表5: Flickr30kにおけるテキストおよび画像検索。リコールスコアはトップ1、5、10レベルで示されている。我々の結果はhuggingfaceのViLTに基づいている。元の結果および他の比較は[26]から引用。

	MSCOCO (5K)
Model	BLEU@4	METEOR	CIDEr	SPICE
CLIP-ViL [51]	40.2	29.7	134.2	23.8
BLIP [32]	40.4	-	136.7	-
VinVL [64]	41.0	31.1	140.9	25.4
SimVLM [57]	40.6	33.7	143.3	25.4
LEMON [16]	41.5	30.8	139.1	24.1
CoCa [61] (proprietary)	40.9	33.9	143.6	24.7
CoCa	6.9	12.8	31.1	9.1
CoCa (transf. 8e-5)	17.9	19.4	60.8	13.7
CoCa FT	34.9	29.7	123.1	23.5
CoCa FT (transf. 5e-6)	35.2	29.8	124.0	23.3

表6: MSCOCOにおける画像キャプショニング。 CoCaはLAION-2B [49]で事前訓練されたベースラインモデルを指し、CoCa FTはMSCOCOでキャプショニング用に微調整されたモデルを指す。最良の結果と知識転移による改善を太字で強調している。

4.4.1 Segmentation

セグメンテーションについては、ゼロショット手法であるMedCLIP-SAMv2 [29, 30]を採用する。MedCLIP-SAMv2は、事前訓練されたCLIPモデルから活性化マップを計算し、それらをSegment Anything Model (SAM) [27]のクエリとして使用することで機能する。活性化マップは、Multi-Modal Information Bottleneck Attribution (M2IB) [55]を用いて、対象画像とクエリプロンプトを使用して計算される。本稿では、知識転移を活用して異なる概念に対する活性化マップの品質を向上させることを目指す。これにより、最終的なセグメンテーションの精度が向上するはずである。我々は4つの異なるセグメンテーションタスクを対象とする: CT画像における肺結節のセグメンテーション（UnitoChest）、CXR画像における肺気胸のセグメンテーション（SIIM Pneumothorax）、超音波画像における乳房結節のセグメンテーション（UDIAT）、およびMRIにおけるグリオーマのセグメンテーション（BraTS23）。

全セグメンテーションタスクにわたる全体的な結果を表4に示す。反転に使用されたキャプションは補足資料に記載されている。微調整されたモデルでM2IB活性化マップを計算するために、[30]で提案されているような記述的なプロンプトを使用する。プロンプトは表4にP1からP4として各タスクに対して記載されている。また、各タスクに対するMedCLIP-SAMv2の参照結果も報告する。MedCLIP-SAMv2の元の設定と比較して、肺結節と肺気胸は完全に新しい概念である。また、脳グリオーマのクラスについても、元の脳腫瘍タスクとわずかな違いがあり、これは補足ファイルで説明されている。セグメンテーションの品質を評価するために、Dice-Sørensen係数（DSC）、正規化表面距離（NSD）、および交差部分と和集合の比（IoU）の3つの指標を使用する。微調整の学習率の異なる値での結果を報告する。全タスクにわたってセグメンテーション指標の向上が観察され、特に乳房超音波（NSD 59.44%から61.56%）と脳MRI（NSD 20.97%から22.26%）で顕著である。肺結節と気胸については、改善はそれほど顕著ではないが、これはおそらくタスクの新規性がMedCLIP-SAM設定での改善をより困難にしているためである。

4.4.2 Text and image retrieval

Flickr30kデータセットでテキストおよび画像検索の実験を行う。これらの実験では、huggingfaceバージョン⁴⁴4https://huggingface.co/dandelin/vilt-b32-mlm-itmのViLT [26]を使用する。知識転移を用いてViLTを微調整するために、モデルの一般的な知識を向上させるのに役立つ可能性のある一般的な概念のキャプションを使用する。この目的のために、MSCOCOの80のオブジェクトカテゴリをターゲット概念として使用し、セクション4.2で紹介した方法を用いて、ChatGPT-4を使用する。全てのキャプションは補足資料に記載されている。各キャプションに対して10枚の反転画像を生成し、合計800枚の反転画像を得る。微調整はセクション4.3と同様に、正のペアに対してITMスコアを最大化し、負のペアに対して最小化することで行う。

知識転移の前後のゼロショットテキストおよび画像検索の結果を表5に示す。比較のために、[26]からのViLTの元の結果も、他の関連するベースラインと共に報告する。結果は異なるレベル（トップ1、トップ5、トップ10のリコール）で計算されたリコール（Rとマークされている）で示されている。結果から観察されるように、知識転移は画像およびテキスト検索タスクの両方で全ての指標において一貫して結果を改善している。特に、テキスト検索タスクでは73.8%から74.6%へとほぼ1%の改善を達成している。追加の設定は補足資料で利用可能である。

4.4.3 Captioning

MSCOCOデータセットでキャプショニングの実験を行う。このタスクには、最先端のキャプショナーであるCoCaアーキテクチャ[61]を採用する。具体的には、LAIONによってリリースされたオープンソースバージョン⁵⁵5https://github.com/mlfoundations/open_clipを使用する。これは元のものが独自のものであるためである。CoCaはCLIPモデルに自己回帰テキストデコーダーを追加することで構築されており、したがって微調整時には、InfoNCE損失とキャプショニング損失[61]を共同で適用する。キャプショニング損失は、前のトークン $y_{t}$ と画像 $y_{<t}$ が与えられた次のトークン $x$ を予測することを目的としている。キャプションとしては、「Xの写真」のような単純なテンプレートセットを使用し、さまざまな変更を加えている。これらは補足資料に記載されている。

結果を表6に示す。標準的なpycocoevalcapパッケージ⁶⁶6https://github.com/salaniz/pycocoevalcapを使用して計算された異なる評価指標（BLEU、METEOR、CIDEr、SPICE）を報告する。我々は2つのバリアントのCoCaで実験を行う：1つはLAION-2B [49]で事前訓練されたもの、もう1つはMSCOCOでキャプショニング用にさらに微調整されたものである。比較のために、独自のCoCa [61]からの参照結果も、他の手法と共に報告する。知識転移により、CoCa FTのほぼ全ての指標で改善し、BLEU@4で35.2に達している。注目すべき結果は、事前訓練のみのCoCaで達成されており、全ての指標で大幅な改善が見られ、時には倍増している（例えば、BLEU@4が6.9から17.9へ）。このモデルは元々MSCOCOでのキャプショニング用に訓練されていないにもかかわらず、知識転移のみによって改善が導入され、実際の画像を全く使用していないことを再度指摘したい。オープンCoCaのパフォーマンスは元の論文で報告されている独自の結果に匹敵しないため、最先端の結果に到達することはできないが、知識転移によってもたらされた改善は注目に値する。

5 Conclusions and Future Works

我々は、知識転移と呼ぶ手法を用いて、テキストによる説明のみを使用して新しい視覚的概念を学習する方法を提示する。広範な評価を通じて、知識転移が事前学習済みモデルに新しい概念を導入し、既存のタスクのパフォーマンスを損なうことなく成功することを示す。また、知識転移がセグメンテーション、テキスト画像検索、キャプション生成などの下流のゼロショットタスクの結果を改善し、医療画像などのドメイン外の一般化にも可能性を示すことを実証する。提案手法は、ターゲット概念に対する理想的な画像を合成するためのモデル反転に基づいており、その後、CLIP [45]のような画像-テキストマッチング方式でモデルを微調整するために使用される。我々の手法は、既知の低レベルの視覚的特徴を新しい高レベルの概念に整合させることを目的として、事前学習済みモデルの事前知識を活用する。本稿では明示的知識転移に焦点を当てたが、マルチモーダルニューロンを利用することで暗黙的知識転移も達成可能であると我々は仮説を立てている。今後の研究はこのトピックに焦点を当てる予定である。

References

Adewole et al. [2023] Maruf Adewole, Jeffrey D Rudie, Anu Gbdamosi, Oluyemisi Toyobo, Confidence Raymond, Dong Zhang, Olubukola Omidiji, Rachel Akinola, Mohammad Abba Suwaid, Adaobi Emegoakor, et al. The brain tumor segmentation (brats) challenge 2023: glioma segmentation in sub-saharan africa patient population (brats-africa). ArXiv, 2023.
AI@Meta [2024] AI@Meta. Llama 3 model card. 2024.
Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
Barbano et al. [2023] Carlo Alberto Barbano, Benoit Dufumier, Enzo Tartaglione, Marco Grangetto, and Pietro Gori. Unbiased supervised contrastive learning. In The Eleventh International Conference on Learning Representations, 2023.
Chaudhry et al. [2022] Hafiza Ayesha Hoor Chaudhry, Riccardo Renzulli, Daniele Perlo, Francesca Santinelli, Stefano Tibaldi, Carmen Cristiano, Marco Grosso, Giorgio Limerutti, Attilio Fiandrotti, Marco Grangetto, et al. Unitochest: A lung image dataset for segmentation of cancerous nodules on ct scans. In International Conference on Image Analysis and Processing, pages 185–196. Springer, 2022.
Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
Chen et al. [2023] Yunhao Chen, Zihui Yan, and Yunjie Zhu. A unified framework for generative data augmentation: A comprehensive survey. arXiv preprint arXiv:2310.00277, 2023.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
Gatys et al. [2016] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
Girdhar et al. [2023] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
Goh et al. [2021] Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks. Distill, 6(3):e30, 2021.
Gupta et al. [2016] Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2827–2836, 2016.
Held et al. [2011] Richard Held, Yuri Ostrovsky, Beatrice de Gelder, Tapan Gandhi, Suma Ganesh, Umang Mathur, and Pawan Sinha. The newly sighted fail to match seen with felt. Nature neuroscience, 14(5):551–553, 2011.
Hu et al. [2021a] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021a.
Hu et al. [2021b] Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pretraining for image captioning. 2022 ieee. In CVF Conference on computer vision and pattern recognition (CVPR), pages 17959–17968, 2021b.
Huang et al. [2021] Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3942–3951, 2021.
Huo et al. [2024] Fushuo Huo, Wenchao Xu, Jingcai Guo, Haozhao Wang, and Song Guo. C2kd: Bridging the modality gap for cross-modal knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16006–16015, 2024.
Irvin et al. [2019] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, pages 590–597, 2019.
Jahanian et al. [2022] Ali Jahanian, Xavier Puig, Yonglong Tian, and Phillip Isola. Generative models as a data source for multiview representation learning. In International Conference on Learning Representations, 2022.
Johnson et al. [2019] Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
[22] Kaggle. ImageNet100. https://www.kaggle.com/datasets/ambityga/imagenet100. [Accessed Nov. 2024].
Karpathy and Fei-Fei [2015] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
Kazemi et al. [2024] Hamid Kazemi, Atoosa Chegini, Jonas Geiping, Soheil Feizi, and Tom Goldstein. What do we learn from inverting clip models? arXiv preprint arXiv:2403.02580, 2024.
Khosla et al. [2020] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
Kim et al. [2021] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In International conference on machine learning, pages 5583–5594. PMLR, 2021.
Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
Kirkpatrick et al. [2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
Koleilat et al. [2024a] Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Medclip-sam: Bridging text and image towards universal medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 643–653. Springer, 2024a.
Koleilat et al. [2024b] Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Medclip-samv2: Towards universal text-driven medical image segmentation. arXiv preprint arXiv:2409.19483, 2024b.
Li et al. [2020] Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):11336–11344, 2020.
Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
Li et al. [2021] Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, and Kai-Wei Chang. Unsupervised vision-and-language pre-training without parallel images and captions. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5339–5350, Online, 2021. Association for Computational Linguistics.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
Lin et al. [2023] Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, and Deva Ramanan. Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19325–19337, 2023.
Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
Locke [1948] John Locke. An essay concerning human understanding, 1690. 1948.
Lu et al. [2019] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
Mo and Morgado [2024] Shentong Mo and Pedro Morgado. Unveiling the power of audio-visual early fusion transformers with dense interactions through masked modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27186–27196, 2024.
Mordvintsev et al. [2015] A. Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going deeper into neural networks. 2015.
Nukrai et al. [2022] David Nukrai, Ron Mokady, and Amir Globerson. Text-only training for image captioning using noise-injected clip. arXiv preprint arXiv:2211.00575The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
OpenAI [2024] OpenAI. Chatgpt, 2024. Nov 2024 Version.
Pan et al. [2023] Haowen Pan, Yixin Cao, Xiaozhi Wang, and Xun Yang. Finding and editing multi-modal neurons in pre-trained transformer. arXiv preprint arXiv:2311.07470, 2023.
Qi et al. [2020] Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data, 2020.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
Rombach et al. [2021] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. 2022 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Sandfort et al. [2019] Veit Sandfort, Ke Yan, Perry J Pickhardt, and Ronald M Summers. Data augmentation using generative adversarial networks (cyclegan) to improve generalizability in ct segmentation tasks. Scientific reports, 9(1):16884, 2019.
Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
Schwettmann et al. [2023] Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David Bau, and Antonio Torralba. Multimodal neurons in pretrained text-only transformers. In 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2854–2859, 2023.
Shen et al. [2022] Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can CLIP benefit vision-and-language tasks? In International Conference on Learning Representations, 2022.
Shiraishi et al. [2000] Junji Shiraishi, Shigehiko Katsuragawa, Junpei Ikezoe, Tsuneo Matsumoto, Takeshi Kobayashi, Ken-ichi Komatsu, Mitate Matsui, Hiroshi Fujita, Yoshie Kodera, and Kunio Doi. Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. American journal of roentgenology, 174(1):71–74, 2000.
Tang et al. [2021] Zineng Tang, Jaemin Cho, Hao Tan, and Mohit Bansal. Vidlankd: Improving language understanding via video-distilled knowledge transfer. Advances in Neural Information Processing Systems, 34:24468–24481, 2021.
[54] Radiopaedia Team. Radiopaedia. https://radiopaedia.org/. [Accessed Nov. 2024].
Wang et al. [2023] Ying Wang, Tim G. J. Rudner, and Andrew Gordon Wilson. Visual explanations of image-text representations via multi-modal information bottleneck attribution. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Wang et al. [2022a] Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. MedCLIP: Contrastive learning from unpaired medical images and text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3876–3887, Abu Dhabi, United Arab Emirates, 2022a. Association for Computational Linguistics.
Wang et al. [2022b] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. SimVLM: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations, 2022b.
Xu et al. [2020] Han Xu, Yao Ma, Hao-Chen Liu, Debayan Deb, Hui Liu, Ji-Liang Tang, and Anil K Jain. Adversarial attacks and defenses in images, graphs and text: A review. International journal of automation and computing, 17:151–178, 2020.
Yap et al. [2017] Moi Hoon Yap, Gerard Pons, Joan Marti, Sergi Ganau, Melcior Sentis, Reyer Zwiggelaar, Adrian K Davison, and Robert Marti. Automated breast ultrasound lesions detection using convolutional neural networks. IEEE journal of biomedical and health informatics, 22(4):1218–1226, 2017.
Young et al. [2014] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
Yu et al. [2022] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022.
Zawacki et al. [2019] Anna Zawacki, Carol Wu, Shih George, Julia Elliott, Mikhail Fomitchev, Hussain Mohannad, Paras Lakhani, Phil Culliton, and Shunxing Bao. Siim-acr pneumothorax segmentation challenge. Kaggle, 2019.
Zhang et al. [2020] Huijuan Zhang, Zongrun Huang, and Zhongwei Lv. Medical image synthetic data augmentation using gan. In Proceedings of the 4th International Conference on Computer Science and Application Engineering, pages 1–6, 2020.
Zhang et al. [2021] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588, 2021.
Zhou et al. [2023] Yongchao Zhou, Hshmat Sahak, and Jimmy Ba. Training on thin air: Improve image classification with generated data. arXiv preprint arXiv:2305.15316, 2023.

6 Impact and Limitations

我々は、知識転移が事前学習済みモデルに新しい概念を導入するための影響力のある技術になる可能性があると考えている。全体として、知識転移は計算要件の観点からは非常に安価である。なぜなら、ほんの一握りの合成サンプルに対する微調整のみで機能するからである。したがって、非常に迅速であり、大量のメモリを必要としない。この意味で、低ランク適応（LoRA）のようなパラメータ効率の良い微調整（PEFT）技術と同等であると言えるかもしれない[15]。PEFTは微調整に必要なメモリ量を最小限に抑える。しかし、PEFTと比較して、知識転移は各新規概念に対する単一のテキスト記述以外の実データを必要としない。明示的知識転移の主な限界は反転ステップにある。これは微調整と比較して最も時間がかかる。このステップを回避できれば、最小限の計算要件で新規概念のほぼリアルタイムの学習を実現できる可能性がある。これにより、多くの実世界のアプリケーションで急速に改善する知的エージェントの開発が可能になるかもしれない。我々は、これが暗黙的知識転移で可能であると仮説を立てている。例えば、マスク言語モデリング（MLM）を知識転移の代用として使用することである。しかし、本稿では、予備実験（セクション8.9に示す）が明示的知識転移と比較して満足のいく結果を達成しなかったため、このトピックに焦点を当てていない。

もう一つの限界は、最先端のアプローチとの比較が限られていることにある。しかし、我々の知る限り、我々と同じ目標を共有する他の研究は認識していない。

7 Knowledge Transfer

7.1 Possible improvements of Explicit Transfer

7.1.1 Relaxation of Eq. 3

式 3 のように $\hat{X}^{*}_{V}$ を計算すると、図 3 に示されるように、自然画像の学習分布とは大きく異なる画像が生成される可能性がある。そこで、視覚エンコーダー全体 $f_{V}$ を反転させるのではなく、モデルの上位から始まる層の一部 $\Psi_{V}\subset f_{V}$ のみを反転させることができる：

\begin{split}\hat{Z}^{*}_{V}=\Psi_{V}^{-1}(f_{T}(X_{T}))\approx\max_{\hat{Z}^{% *}_{V}}sim(\Psi_{V}(\hat{Z}^{*}_{V}),f_{T}(X_{T}))\\ +R(\hat{Z}^{*}_{V})\end{split}

(4)

ここで、 $R$ はスタイル転移 [10] に類似した正則化であり、 $\hat{Z}^{*}_{V}$ が自然画像の中間表現に類似するよう促すものである。

7.2 Implicit Knowledge Transfer

本稿では明示的な知識転移に焦点を当てているが、完全を期すために暗黙的知識転移の背後にある考え方を簡単に紹介する。マルチモーダルモデルにおいてマルチモーダルニューロンが見出されることが示されている[12, 50]。これらのニューロンは、いずれのモダリティにおいても同じ概念に対して高い活性化を示し、クロスモーダル表現を捉えることができる。我々は、共有パラメータアーキテクチャ（例えば、early-fusionトランスフォーマー[39, 26]）において、これらのニューロンを知識転移に利用できるはずであると仮説を立てている。例えば、新しい概念の記述に対して単純なマスク言語モデリングを行うことで、モデル反転の必要性を効果的に排除できる。この目的のためには、単一のモダリティを独立して処理できるearly-fusionアーキテクチャが必要となる。しかし、我々の知る限り、現時点でこれらの要件を満たす大規模な事前学習モデルはあまり存在しないため、この方向性の詳細な探求は今後の研究課題として残しておく。異なるモダリティを独立して学習することが有効であるという示唆は、文献に見られる。例えば、U-VisualBERTの事前学習中に見られる[33]。我々の研究にさらに関連して、[57]の著者らはSimVLMにおけるクロスモーダル転移の能力について報告しているが、このモデルは独自のものであり、我々はその主張を再現することができない。したがって、ここではViLTに焦点を当て、予備的な分析を第8.9節で報告する。

7.3 Open questions

Q1 Domain Gap.

図3に示されているような反転画像は、自然画像とは大きく異なって見える。しかし、本稿の結果が示すように、これらの画像でモデルを微調整すると結果が改善される。反転画像と実際の画像の間にドメインギャップは存在するのだろうか？それとも、深層モデルが視覚情報を処理する根本的な違いを示しているのだろうか？この現象は敵対的攻撃と関連している可能性がある[58]。

Q2 Generalizability of inversion

Q1に関する洞察を提供する可能性のある興味深い分析点は、反転画像の一般化可能性である。例えば、特定のモデル（例えばCLIP）で反転された画像を使用して、他のモデルを一から訓練することは可能だろうか？それとも、反転に使用された特定のモデルでのみ機能するように「適合」しているのだろうか？

Q2 Catastrophic Forgetting

知識転移を適用する際、破滅的忘却をどの程度防ぐことができるだろうか？本研究では、一般的に低い学習率を用いることで、新しい概念の学習と以前の情報の保持のバランスを取ることができることを示している。しかし、まだ改善の余地がある。例えば、LoRA [15] は微調整中の破滅的忘却を回避するのに役立つことが示されており、知識転移中にこれを適用することでさらに結果を改善できる可能性がある。また、暗黙的転移（パラメータ共有モデルにおいて）は、例えばマルチモーダルニューロンに焦点を当てることで、明示的転移よりも破滅的忘却を回避できる可能性がある。

8 Experiments

8.1 CLIP on rare concepts

我々は、Adamオプティマイザーを使用し、バッチサイズ4、重み減衰0.2、本文の表に記載されている1e-5から5e-5の間の学習率で訓練を行った。各概念について10枚の反転画像を用いて訓練を行った。反転に使用したキャプションは表LABEL:tab:captions-rare-conceptsに記載されている。

8.2 Details about image inversion for ViLT

ViLTについては、CLIPで使用した画像反転アプローチとは異なり、ランダムアフィン拡張を無効にした。これは、ノイズの多い反転画像が生成されたためである。さらに、ViLTの著者らが使用したものと一致する0.01の重み減衰値を使用した。本文で説明したように、ViLTに対する我々の反転プロセスは、ViLTのITMヘッドによって計算される画像-テキストマッチング（ITM）スコアを最大化する[26]。このヘッドは2つの値を出力する：1つは不一致を示し、もう1つは一致を示す。これを最適化するために、反転中にクロスエントロピー損失を使用し、一致に対応する出力を最大化しながら、不一致に対する出力を最小化することを目指す。ViLTの反転に使用したキャプションは表LABEL:tab:vilt-concept-descriptionsに記載されている。

8.3 Ablation study

ここでは追加のアブレーション研究を報告する。我々は、ファインチューニングのためのキャプションの構築に焦点を当てる。3.1.2節で説明したように、ファインチューニング中に各キャプションの前に概念の名前を付加する。例えば、"A moongate is […]"のようにする。ここでは、名前を付加したキャプションと名前のないキャプションを比較することで、なぜこれが必要なのかを説明する。結果は図5に示されている。観察できるように、ファインチューニング中に概念の名前を使用することは、視覚的特徴をその文章による説明にマッピングするために必要である。

8.4 MedCLIP

MedCLIPについては、希少概念に対するCLIPと同様のセットアップを使用する。セクション8.1を参照されたい。具体的には、バッチサイズ4、重み減衰0.2のAdamを使用し、各概念に対して10枚の反転画像を用いる。MedCLIPでの反転に使用した説明は表LABEL:tab:captions-jsrtに記載されている。

8.5 CLIP on medical images (out of domain)

ViT-B/32については、学習率5e-5、バッチサイズ8で5エポック訓練する。ViT-L/14については、学習率1e-5、バッチサイズ4で2エポック訓練する。反転に使用されたキャプションは表LABEL:tab:cations-chexpert-5x200cに記載されている。

	Lung Nodules^†			Lung Pneumothorax^†			Breast Ultrasound			Brain MRI
Model	DSC	NSD	IoU	DSC	NSD	IoU	DSC	NSD	IoU	DSC	NSD	IoU
MedCLIP-SAMv2	14.83%	17.30%	8.64%	6.30%	7.61%	3.75%	56.25%	59.44%	47.81%	17.20%	20.97%	12.05%
Transf. (1e-5)	13.95%	17.45%	8.75%	6.28%	7.59%	3.77%	58.23%	61.56%	49.52%	15.90%	19.36%	11.10%
Transf. (2e-5)	14.10%	17.65%	8.83%	6.41%	7.76%	3.83%	54.36%	57.30%	46.30%	18.13%	22.26%	12.62%
Transf. (3e-5)	14.10%	17.65%	8.85%	6.25%	7.55%	3.73%	55.70%	59.00%	47.49%	15.47%	18.85%	10.78%
Transf. (4e-5)	14.25%	17.85%	8.94%	6.24%	7.57%	3.71%	53.86%	56.82%	45.61%	15.26%	18.63%	10.62%
Transf. (5e-5)	14.20%	17.78%	8.92%	6.20%	7.51%	3.70%	54.90%	57.97%	46.09%	16.22%	19.81%	11.34%
Transf. (1e-4)	14.35%	18.03%	9.04%	6.02%	7.29%	3.59%	-	-	-	-	-	-
Transf. (2e-4)	10.74%	13.64%	6.66%	4.71%	5.54%	2.86%	-	-	-	-	-	-

表7: MedCLIP-SAMv2によるゼロショットセグメンテーションの完全な結果。

8.6 Segmentation

MedCLIP-SAMv2に対する異なる学習率での知識転移の結果を表 7 に示す。知識転移によって達成された改善の例示を図 6 および図 7 に示す。セグメンテーションのための反転に使用されたキャプションは表 LABEL:tab:captions-segmentation に記載されている。

Differences in downstream tasks

本文で述べたように、肺結節と気胸のセグメンテーションは、MedCLIP-SAMv2が事前学習されていない新規タスクである。脳腫瘍に関しては、我々は成人患者の脳神経膠腫を含むBraTS 2023神経膠腫データセットを使用している。[30]で報告された脳腫瘍に関する元の性能と比較すると、大きな差があることに気づく。しかし、画像の前処理が大きく異なっており、BraTS 2023のデータは[30]よりもより重度に前処理されている（例：頭蓋骨除去）。本稿執筆時点では、データ分割の詳細が不明であったため、元のデータでMedCLIP-SAMv2を比較することはできなかった。

			Flickr30k (1K)
			Text Retrieval			Image Retrieval
Model	LR	Batch Size	R@1	R@5	R@10	R@1	R@5	R@10
ViLT-B/32 (huggingface)	-	-	73.8%	93.5%	96.5%	57.3%	83.9%	90.4%
ViLT-B/32	8e-7	32	74.5%	93.8%	96.4%	57.7%	84.0%	90.4%
ViLT-B/32	9e-7	32	74.6%	93.8%	96.4%	57.8%	84.0%	90.5%
ViLT-B/32	1e-6	16	74.4%	93.8%	96.5%	57.7%	84.1%	90.5%
ViLT-B/32	2e-6	128	74.6%	93.7%	96.5%	57.8%	84.0%	90.5%
ViLT-B/32	3e-6	256	74.5%	93.9%	96.5%	57.7%	83.9%	90.5%
ViLT-B/32	4e-6	32	73.8%	93.6%	96.5%	57.4%	84.0%	90.5%
ViLT-B/32	5e-6	256	74.5%	93.9%	96.5%	57.6%	84.0%	90.5%
ViLT-B/32	8e-6	32	73.2%	93.7%	96.1%	57.4%	83.7%	90.4%
ViLT-B/32	1e-5	128	74.4%	93.8%	96.8%	56.8%	83.7%	90.6%
ViLT-B/32	2e-5	32	71.8%	93.2%	96.4%	56.7%	83.6%	90.4%
ViLT-B/32	3e-5	32	70.8%	92.1%	95.7%	56.0%	82.9%	90.2%

表8: ViLTを用いたFlickr30kにおけるテキストおよび画像検索の完全な結果。最初のセクションはベースラインの結果を報告し、2番目のセクションは各テスト済み学習率とその最適バッチサイズ（16、32、64、128、256の中から選択）の結果を示している。トップ1、5、10でのリコールスコアが報告されている。

8.7 Text-image retrieval

本節では、ViLTを用いたFlickr30kにおけるテキストおよび画像検索タスクの完全な結果を示す。表8は表5の拡張版であり、huggingfaceの事前学習済みベースラインに加えて、学習率とバッチサイズを調整しながら我々が実施した実験の結果を報告している。各学習率について最適なバッチサイズの結果を報告する。この設定では、我々の手法は小さな学習率で最も効果的であることが分かる。反転に使用されたキャプション（mscoco）は表LABEL:tab:captions-mscocoに記載されている。

8.8 Captioning

これらの実験では、2種類のキャプションを扱う。1つ目はコンセプトキャプションであり、これは他のすべての実験と同様に反転とInfoNCEを用いた微調整に使用する（セクション10に記載）。2つ目はターゲットキャプションであり、これはCoCaの自己回帰キャプショニングデコーダーを $\mathcal{L}_{cap}$ で微調整するために使用する。

Captioning Loss

反転画像で微調整を行う際、我々は[61]で定義されている自己回帰キャプショニング損失を適用する：

\mathcal{L}_{cap}=-\sum_{t=1}^{T}\log P_{\theta}(y_{t}|y_{<t},x)

(5)

これは、前のトークン $y_{<t}$ と画像 $x$ が与えられた時に、次のトークン $y_{t}$ を予測することを目的としている。我々が最適化する最終的な目的関数は、InfoNCE損失とキャプショニング損失の組み合わせである：

\mathcal{L}=\lambda_{1}\mathcal{L}_{CLIP}+\lambda_{2}\mathcal{L}_{cap}

(6)

ここで、 $\lambda_{1},\lambda_{2}\geq 0$ である。我々の微調整では、 $\lambda_{1}=1$ と $\lambda_{2}=0.1$ を使用する。

Target captions template

微調整中のターゲットキャプションとして、26種類の異なるテンプレートを使用する。最適化の各ステップで、各サンプルに対してランダムにテンプレートを選択する方法は以下の通りである：

⬇

1TEMPLATES = (

2 lambda c: f'a bad photo of a {c}.',

3 lambda c: f'a low resolution photo of the {c}.',

4 lambda c: f'a rendering of a {c}.',

5 lambda c: f'a bad photo of the {c}.',

6 lambda c: f'a cropped photo of the {c}.',

7 lambda c: f'a photo of a hard to see {c}.',

8 lambda c: f'a bright photo of a {c}.',

9 lambda c: f'a photo of a clean {c}.',

10 lambda c: f'a photo of a dirty {c}.',

11 lambda c: f'a dark photo of the {c}.',

12 lambda c: f'a photo of my {c}.',

13 lambda c: f'a bright photo of the {c}.',

14 lambda c: f'a cropped photo of a {c}.',

15 lambda c: f'a photo of the {c}.',

16 lambda c: f'a good photo of the {c}.',

17 lambda c: f'a rendering of the {c}.',

18 lambda c: f'a photo of one {c}.',

19 lambda c: f'a close-up photo of the {c}.',

20 lambda c: f'a photo of a {c}.',

21 lambda c: f'a low resolution photo of a {c}.',

22 lambda c: f'

8.9 Preliminary results with Implicit Knowledge Transfer

Type	Concept		Baseline	1e-5	2e-5	3e-5	4e-5	5e-5
				Learning Rate
Implicit	Moongate	Target Acc.	0%	0%	0%	0%	0%	0%
		ImageNet* 0-shot	23.74%	23.82%	23.90%	23.98%	23.94%	23.86%
	Tonometer	Target Acc.	10%	10%	10%	10%	10%	0%
		ImageNet* 0-shot	23.74%	23.84%	23.86%	23.70%	23.64%	23.60%
	Gyroscope	Target Acc.	50%	50%	60%	60%	60%	50%
		ImageNet* 0-shot	23.74%	23.74%	23.62%	23.42%	23.44%	23.46%
Explicit	Moongate	Target Acc.	0%	0%	0%	0%	0%	0%
		ImageNet* 0-shot	23.74%	23.80%	24.08%	24.02%	24.10%	24.20%
	Tonometer	Target Acc.	10%	10%	10%	10%	10%	10%
		ImageNet* 0-shot	23.74%	23.80%	23.74%	23.72%	23.70%	23.56%
	Gyroscope	Target Acc.	50%	50%	50%	50%	40%	30%
		ImageNet* 0-shot	23.74%	23.74%	23.84%	23.84%	23.84%	23.82%

表11: ViLTを用いたマスク言語モデリングによる新規および稀少な概念に対する知識転移。暗黙的知識転移では、ノイズ画像と対応するマスクされたキャプションをViLTに入力する。明示的知識転移では、ノイズ画像を反転画像に置き換える。

本節では、セクション7.2で紹介した暗黙的知識転移に関する予備的結果を示す。暗黙的知識転移の目的は、反転画像を使用せず、テキストのみで訓練することによってモデルに新しい概念を教えることである。ViLTでこれを行うために、我々は画像を必要とする画像-テキストマッチング目的関数を使用せず、代わりにマスク言語モデリング（MLM）[9]を採用し、概念のテキスト記述とランダムノイズ（画像の代わり）のペアを入力として使用する。ここでの仮説は、モダリティ間で共有されるパラメータを持つモデルにおいて、一つのモダリティ（テキスト）でファインチューニングすることが、他のモダリティにも利益をもたらすというものである。我々の仮説は、ファインチューニング中にマルチモーダルニューロン[50]がモダリティ間の知識転移を支援できるというものである。

8.9.1 Implicit Knowledge Transfer with MLM

暗黙的知識転移には、ViLT[26]と同じマスク言語モデリングのセットアップを使用した。これは、全単語マスキングと15%のマスキング確率を使用することを意味する。我々は、ファインチューニングに10個の例を使用し、各例はランダムノイズ画像とマスクされたキャプションで構成されている。マスクされたキャプションは、同じキャプションから毎回異なるマスキングを行うことで生成される。キャプションには「A $X$ is $Y$ 」というテンプレートを使用し、 $X$ は概念の名前、 $Y$ は概念の説明（表LABEL:tab:vilt-concept-descriptionsから）である。我々は、バッチサイズ4で異なる学習率を使用し、合計3回の訓練ステップを行う。重み減衰は、他の実験と同様に0.01に設定される。

Explicit Knowledge Transfer baseline with MLM

比較のため、我々は画像-テキストマッチング目的関数の代わりにマスク言語モデリング目的関数を用いた明示的知識転移の結果も評価する。我々は、ランダムノイズ画像の代わりに反転画像を使用する点を除いて、暗黙的な場合と同じセットアップを使用する。特に、我々は画像-テキストマッチング目的関数を用いた明示的知識転移で使用したのと同じ反転画像を使用する。

8.9.2 Results discussion

表11は、マスク言語モデリングを用いた暗黙的および明示的知識転移の両方の結果を示している。両方の場合において、moongateの概念に対する改善は見られず、精度は0%のままである。tonometer については、暗黙的な場合では性能の低下が見られるため、明示的知識転移の方がうまく機能しているように見える。一方、gyroscopeについては逆の結果となっている。すべての場合において、画像-テキストマッチング目的関数を使用した場合と同様に、ImageNet-100クラスに対する精度の向上が観察される。唯一の改善は、暗黙的転移設定におけるgyroscope概念で、50%から60%への向上が見られた。全体として、マスク言語モデリングを用いた暗黙的知識転移はViLTモデルでは機能しないと言える。これは恐らく、ViLTが画像-テキストペアで事前訓練されており、入力として両方のモダリティを期待しているためである。MLMを用いた明示的知識転移に関しては、正しいアルゴリズムとハイパーパラメータのセットを決定するためにさらなる実験が必要である。例えば、異なるテキスト記述から生成されたより多くの例を使用する必要があるかもしれない。

9 Code

コードは論文が採択された後に公開される予定である。

10 List of captions

表12: 稀少な概念の説明（Llama-3-8B-Instructで生成）。

Moongate	A perfectly circular archway built from uniformly cut stones or bricks, set into a larger wall. It forms a smooth circle, framing views of gardens or landscapes beyond, creating a picturesque portal.
Tonometer	A slender, pen-like probe attached to a small base equipped with precise dials and gauges. This tool is often part of a larger medical apparatus, featuring a metallic finish and a refined, professional appearance.
Gyroscope	A series of gleaming silver rings, each nested perfectly within the next, surrounds a central disk that spins smoothly. The rings are connected by intersecting axes, allowing the disk to tilt and rotate freely while maintaining a sophisticated, mechanical look.

表13: 稀少な概念の手動で短縮された説明（ViLTの40トークン入力に適合させるため）

Moongate	A perfectly circular archway built from uniformly cut stones or bricks, set into a larger wall. It forms a smooth circle, framing views of gardens, creating a picturesque portal.
Tonometer	A slender, pen-like probe attached to a small base equipped with precise dials and gauges. This tool is often part of a larger medical apparatus.
Gyroscope	A series of rings each nested within the next, surrounds a central disk that spins. The rings are connected by intersecting axes allowing the disk to rotate freely.

表14: JSRTの医療クラスの説明（RadiopaediaとChatGPT-4の組み合わせ）。

Benign Nodule	A small, round spots appearing in Chest X-Ray, typically well-defined with smooth, regular borders. These spots are often uniformly dense and do not cause distortion of surrounding structures.
Lung Cancer	A dense and irregular mass on Chest X-Ray images often with spiked or uneven edges. It may appear in the lung’s periphery or near the airways.

表15: CheXpert-5x200cの医療クラスの説明（RadiopaediaとChatGPT-4の組み合わせで取得）。

Atelectasis	A small areas of collapsed lung. It is usually seen on Chest X-Rays as small volume linear shadows, usually peripherally or at lung bases, appearing more opaque and shrunken.
Cardiomegaly	Enlargement of the heart usually seen in Chest X-Rays. The central shadow of the chest appears enlarged, extending beyond half the width of the entire chest cavity.
Pleural Effusion	A collection of fluid between the lungs and the chest, which makes the area appear white and smooth in Chest X-Ray images. The area does not present visible lung markings.
Consolidation	An area inside the lungs that appears as branching low attenuating (lucent) bronchi surrounded by high attenuating (dense) consolidated/opacified alveoli on Chest X-Ray images.
Edema	An abnormal accumulation of fluid in the extravascular compartments of the lung, which makes the area whiter in Chest X-Ray images. It is usually present on both lungs.

表16: セグメンテーション用の医療クラスの説明（RadiopaediaとChatGPT-4の組み合わせ）。

Lung Nodules	Circular spots appearing within the lung fields, with clear and defined edges in CT images. They are denser than the surrounding tissue, often appearing in shades of gray or white, with varying size.
Breast Tumor	A dark, irregularly shaped area is visible against the lighter surrounding tissue. The borders may appear uneven or spiculated, and the area is typically less uniform in texture. Shadowing can often be seen beneath the mass.
Pneumothorax	An abnormal collection of air in the pleural space, which allows the parietal and visceral pleura to separate and the lung to collapse. The pleura edge is thin and no lung markings are visible.
Brain Tumor	An irregular bright mass in brain MRI, often with thick and irregular margins, surrounded by vasogenic-type edema or fluid accumulation. It may also have a hemorrhagic component.

表17: テキストと画像検索実験に使用されるMSCOCOクラスの説明（ChatGPT-4使用）。

person	A human figure, typically with visible head, torso, arms, and legs, in various postures.
bicycle	A two-wheeled vehicle with a frame, handlebars, and pedals, usually ridden by a person.
car	A four-wheeled enclosed vehicle with windows and doors, commonly seen on roads.
motorcycle	A two-wheeled motorized vehicle with a seat and handlebars, typically ridden by one or two people.
airplane	A large flying vehicle with wings and a tail, often seen with windows along the sides for passengers.
bus	A large, rectangular vehicle with many windows and seating rows, designed to carry multiple passengers.
train	A long, linked series of vehicles running on tracks, often with a locomotive at the front.
truck	A large vehicle with a separate cab and an open or enclosed cargo area for transporting goods.
boat	A small to medium-sized watercraft with a hull and often visible sails or an engine.
traffic light	A vertical or horizontal post with red, yellow, and green lights, used to control vehicle flow at intersections.
fire hydrant	A small, red, metal cylinder with nozzles on the side, often found on sidewalks for fire emergencies.
stop sign	A red, octagonal sign with the word "STOP" in white, used to indicate where vehicles must halt.
parking meter	A tall, narrow post with a small display and slot, used to pay for parking time.
bench	A long seat, often with a backrest, typically found in parks or public areas.
bird	A small animal with feathers, wings, and a beak, often shown perched or flying.
cat	A small, furry animal with pointed ears, whiskers, and a long tail, often seen sitting or grooming.
dog	A furry, four-legged animal with a tail, usually seen with a collar or leash.
horse	A large, four-legged animal with a mane and tail, often depicted standing or galloping.
sheep	A woolly animal with a round body, small head, and short legs, often seen in groups in fields.
cow	A large animal with a boxy body, horns, and a long face, often shown grazing or with an udder.
elephant	A massive, gray animal with a long trunk, large ears, and tusks.
bear	A large, sturdy animal with thick fur, rounded ears, and a short tail, often shown standing or walking on all fours.
zebra	A horse-like animal with black and white stripes across its body.
giraffe	A very tall animal with a long neck and legs, spotted coat, and small horns on its head.
backpack	A bag with shoulder straps, typically worn on the back and used for carrying personal items.
umbrella	A foldable, rounded canopy on a stick, used for protection from rain or sun.
handbag	A small to medium-sized bag with handles, often carried by hand and used to hold personal items.
tie	A long, narrow piece of fabric worn around the neck, often knotted at the collar of a shirt.
suitcase	A rectangular, boxy container with a handle, used for carrying clothes and personal items when traveling.
frisbee	A flat, round disc often made of plastic, used for throwing and catching.
skis	Long, narrow pieces of equipment attached to boots, used for gliding on snow.
snowboard	A flat, wide board attached to boots, used for sliding on snow.
sports ball	A round object of varying sizes, such as a soccer ball or basketball, used in sports.
kite	A lightweight object with a string, often shaped like a diamond or triangle, designed to fly in the wind.
baseball bat	A smooth, cylindrical wooden or metal stick used to hit a baseball.
baseball glove	A padded, leather glove worn on one hand, used to catch baseballs.
skateboard	A narrow board with wheels, used for rolling and performing tricks.
surfboard	A long, flat board used for riding waves in the ocean.
tennis racket	An oval-shaped frame with strings and a handle, used to hit a tennis ball.
bottle	A narrow-necked container with a cap, often used to hold liquids like water or soda.
wine glass	A stemmed glass with a wide bowl at the top, used for drinking wine.
cup	A small, handleless vessel used for drinking, usually made of ceramic or plastic.
fork	A utensil with multiple prongs, used to pick up food.
knife	A utensil with a long, sharp blade, used for cutting food.
spoon	A utensil with a shallow bowl at the end of a handle, used for eating or serving food.
bowl	A round, deep dish, often used to hold soup or other foods.
banana	A long, yellow fruit with a curved shape and soft interior.
apple	A round fruit, typically red or green, with a stem at the top.
sandwich	Two slices of bread with filling in between, such as meat, cheese, or vegetables.
orange	A round, orange-colored fruit with a thick, textured peel.
broccoli	A green vegetable with a tree-like shape, featuring a thick stalk and small florets.
carrot	A long, orange vegetable with a pointed end, often with green leaves at the top.
hot dog	A sausage in a bun, often with condiments like ketchup or mustard.
pizza	A round, flatbread topped with cheese, sauce, and various toppings, often cut into slices.
donut	A round, fried pastry with a hole in the middle, often glazed or topped with sprinkles.
cake	A sweet, layered dessert, often decorated with frosting or fruit.
chair	A piece of furniture with a backrest and four legs, designed for sitting.
couch	A large, cushioned seat with a backrest and arms, designed for multiple people.
potted plant	A plant growing in a container, often with green leaves or flowers.
bed	A large, rectangular piece of furniture for sleeping, with a mattress and pillows.
dining table	A flat, often rectangular surface with legs, designed for eating meals.
toilet	A porcelain fixture with a seat and flushing mechanism, used in bathrooms.
tv	A rectangular screen on a stand or wall, used for viewing shows and movies.
laptop	A portable computer with a hinged screen and keyboard.
mouse	A small, handheld device used to control a cursor on a computer screen.
remote	A small, rectangular device with buttons, used to control electronics like TVs.
keyboard	A flat, rectangular panel with keys, used for typing on computers.
cell phone	A handheld electronic device with a screen and buttons or touchscreen, used for communication.
microwave	A box-like appliance with a door, used for heating food quickly.
oven	A large appliance with a door and interior racks, used for baking or roasting.
toaster	A small appliance with slots, used to toast bread.
sink	A basin with a faucet, used for washing hands, dishes, or food.
refrigerator	A large, box-like appliance with doors, used to store perishable food at low temperatures.
book	A collection of pages bound together with a cover, containing text or images.
clock	A circular or rectangular device with hands or digital display, showing the current time.
vase	A decorative container, often made of glass or ceramic, used to hold flowers.
scissors	A handheld tool with two blades, used for cutting paper or fabric.
teddy bear	A soft, stuffed toy shaped like a bear, often used by children.
hair drier	A handheld device that blows warm air, used to dry hair.
toothbrush	A small brush with a handle, used for cleaning teeth.