BILLY: Steering Large Language Models via Merging Persona Vectors for Creative Generation

Introduction

The Promise of Creative AI

Creativity is widely recognized as a cornerstone of human progress, driving innovation across domains and enabling major scientific breakthroughs. Recent research has explored the creativity of large language models (LLMs), viewing them as promising tools for applications such as story writing, design ideation, and scientific discovery, thereby augmenting human problem-solving and imagination.

Multi-LLM Paradigms: Promise and Limitations

Recent studies highlight the potential of the multi-LLM paradigm, which aims to simulate human collective intelligence by engaging multiple LLMs in iterative discussion to arrive at more comprehensive and well-balanced solutions. This paradigm allows systems to generate a broader range of ideas beyond what a single model can reach.

In the context of creativity, these frameworks often assign diverse roles to LLMs and employ structured, multi-round interactions. For example, one LLM may act as a creative professional while another serves as an environmentalist, enabling the system to explore and integrate multiple perspectives through iterative exchanges.

⚠️ The Challenge

Despite these advantages, multi-LLM frameworks face a major challenge: substantial computational cost and increased inference time associated with multi-LLM coordination. The iterative nature of multi-LLM dialogue substantially increases input and output tokens, making the process computationally costly and increasing inference latency, limiting scalability in real-world settings.
Also, there is a chance that the models encounter process loss, where the benefits of multi-LLM interaction are offset by inefficiencies in communication and integration, leading to suboptimal outcomes.

The BILLY Solution

We therefore ask: Without relying on multiple models carrying on a multi-round discussion, can a single LLM simulate diverse perspectives in an efficient manner?

While prompting approaches exist (asking models to adopt multiple roles simultaneously), they depend heavily on the model's ability to integrate and coordinate multiple perspectives within a single response. In practice, LLMs may handle each persona separately but struggle to integrate them coherently.

To address these challenges while retaining the creative advantages of multi-LLM interaction, we propose BILLY: BlendIng persona vectors for Large Language model creativitY. Inspired by recent advances in extracting persona vectors and steering model behavior through activation engineering, BILLY extends these ideas by merging multiple persona vectors within a single LLM.

✨ Our Contributions

Enhanced Creativity: BILLY blends multiple persona vectors within a single LLM, generating diverse, multi-perspective creative responses that surpass both single-model prompting and multi-LLM approaches.
Efficiency and Simplicity: Entirely training-free, requiring no additional fine-tuning or multi-LLM communication, achieving superior creativity with substantially lower computational costs.
Interpretability via Persona Vectors: Operating directly in the latent activation space, BILLY offers an interpretable mechanism for understanding and steering creativity with fine-grained control.

About BILLY

BILLY's core idea is to capture the benefits of multi-LLM collaboration—inducing diverse perspectives and specialized expertise—within a single model. We achieve this by extracting and blending multiple distinct persona vectors directly in the model's activation space. During inference, this merged vector steers the generation process, enabling multi-perspective output without explicit multi-LLM communication.

BILLY (BlendIng persona vectors for Large Language model creativitY). To enhance the creativity of a single LLM, we extract and fuse the persona vectors of a Creative Professional and an Environmentalist, steering a base model by this composite vector to generate outputs based on both domains.

Approach

BILLY achieves multi-persona fusion through three stages: persona vector extraction, offline fusion, and inference-time steering.

🎭 Persona Vector Extraction

A persona vector represents the characteristics of a specific persona (e.g., evil, humorous) as a directional vector within the model's activation space, capturing the shift in the model's activation states when it adopts a persona compared to when it responds neutrally.

Extraction Process:

Design contrastive system prompts (positive vs. negative)
Employ an LLM judge to score the alignment of each response with the corresponding trait
Filter the responses with a threshold to ensure a clear distinction between the two corpora

Persona Vector = Mean activation of positive set - Mean activation of negative set

⚡ Offline Fusion

Unlike multi-LLM systems that require costly online interaction between agents, our approach fuses these perspectives in an offline step. We compute a single composite steering vector by taking the average of N extracted persona vectors.

Merged Vector = Average of all persona vectors

🎯 Inference-time Steering

The final stage applies the composite vector to steer the model's generation process. During a standard forward pass, we intervene at a chosen steering layer (layer 20 in our experiments) by adding our composite vector, scaled by a coefficient α.

Steered Activation = Original Activation + α × Merged Vector

Key Advantage: This steering process requires only a single additive operation during each inference step and involves no additional training.

Benchmarks & Metrics

Creativity Benchmarks

Benchmark	Description	Sample Task
AUT	Evaluates divergent thinking by requiring the generation of numerous unconventional applications for an object.	What are some creative uses for a mug?
INSTANCES	Measures the ability to produce a diverse set of examples that satisfy a given property.	Name 5 square things you can think of.
SIMILARITIES	Assesses associative creativity by challenging participants to identify non-obvious connections between two concepts.	Tell me 5 ways in which a brick and a stone are alike.
SCIENTIFIC	Probes creative problem-solving within a scientific framework.	Find different scientific uses for a spoon.

Evaluation Metrics

Originality

Assesses the statistical rarity or unconventionality of a response. Ideas are scored based on novelty and divergence from common answers.

Elaboration

Measures the level of detail and supportive information. It evaluates the ability to expand upon a core idea with depth and context.

Baselines

We compare both single agent prompting methods and multi-agent systems as our baselines. All baseline prompts are detailed in Appendix A.

Single Agent (SA)

A single LLM prompted to respond creatively, with the temperature set to 0.7, which is a commonly used value.

SA with High-temperature Decoding (T=1.0)

The temperature is increased to 1.0 to stimulate higher levels of diversity. By allowing the model to explore a broader range of possible outputs, this setting encourages more diverse responses.

SA with Multi-Role Prompt (SA-MRP)

A multi-role prompting variant where the model is asked to respond from multiple professional perspectives, such as environmentalist, creative professional, and futurist, each with distinct expertise styles. This serves as a strong baseline to determine the performance limits of enhancing creativity through prompt engineering alone.

LLM Discussion

A multi-LLM framework proposed by Lu et al. (2024), which organizes multiple LLM agents into a structured three-phase process—initiation, discussion, and convergence—with each agent role-played under distinct personas to diversify perspectives. Agents exchange ideas over several rounds and then consolidate them into final outputs.

Main Results

We use Qwen-2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-3-4B-it as our base models and employ GPT-4o-mini as the judge. The highest scores are in purple, second-highest in blue. BILLY's Originality scores surpass nearly all baselines across all benchmarks.

Our primary experiments evaluate BILLY against several baselines by steering three distinct open-source models: Qwen-2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-3-4B-it. The main results, aggregated across all four creativity benchmarks, are presented in the table above.

Key Findings:

Consistent Originality Superiority: BILLY consistently outperforms all baseline methods in terms of Originality across four benchmarks, demonstrating the effectiveness of using internal representational control to elicit creative responses.
Outperforms Multi-Agent Systems: For both Qwen and Llama models, BILLY surpasses the strong but costly LLM Discussion, as well as various single-agent configurations.
Prompt Control Limitations: While SA-MRP occasionally provides slight improvements over SA, its performance is inconsistent, reinforcing our hypothesis that prompt-based control is inherently less reliable than direct vector steering.
Model Size Considerations: The Gemma-3-4B-it model struggles with limitations in long-form, multi-round LLM Discussion tasks due to context instability, though BILLY still achieves the highest Originality scores.

Human Evaluation & Correlation

The highest scores are highlighted in purple, while the second-highest are in blue. Across all benchmarks, BILLY (Ours) consistently achieves the strongest Originality and Elaboration scores.

Table 4: Correlation Between Human and LLM Judges

Spearman and Pearson correlations between LLM-based evaluations and average human evaluations for Originality and Elaboration.

Human evaluation is conducted on a subset of methods due to practical constraints. Overall, the trends are consistent with our LLM-based evaluation.

BILLY achieves the highest average scores in both Originality and Elaboration, outperforming the SA and LLM Discussion methods. These results suggest that direct activation steering provides a more effective and stable mechanism for eliciting creative responses.

We also calculated the Spearman’s Rank Correlation Coefficient and Pearson Correlation Coefficient between our LLM-Judge and human raters. The results indicate a strong positive correlation for Originality (0.73 Spearman). While the correlation for Elaboration is more moderate, this aligns with observations reported in the LLM-Discussion (Lu et al., 2024).

Analysis

The following analyses adopt the Llama-3.1-8B-Instruct model, which we find sufficiently representative.

Efficiency Analysis

Comparison of inference time and token cost

BILLY (Ours)

Inference Time: 19s

Token Cost: 0.3$/10k query

LLM Discussion

Inference Time: 513s (27.02 times)

Token Cost: 25.5$/10k query (90.9 times)

Inference Time and Token Cost. BILLY demonstrates a significant reduction in both latency and token cost compared to the LLM Discussion baseline.

Qualitative Result

Qualitative comparison of persona fusion

A case study on the 'Reimagine the Hospital' task. The merged vector demonstrates a true conceptual fusion, reframing substantive concepts from the environmentalist with the visionary style of the creative vector.

Analysis of Various Vector Compositions

Persona Vector Combinations Analysis. Based on the default 4 vectors, we modify the combination of persona vectors from one to seven.

To investigate the relationship between the quantity of fused personas and creativity, we vary the number of merged vectors from one to seven. The results are presented in Figure 2 with the specific persona vector combinations detailed in Section D.

Key Findings:

Creative Professional Dominance: Any vector merge with the creative professional persona results in exceptionally high creative performance.
Diminishing Returns: While increasing the number of persona vectors from one (1-CRE) to three yields a noticeable improvement, further additions from four to seven vectors do not produce additional significant gains.
Non-Additive Benefits: This suggests that the primary benefit of BILLY is not simply additive—quality of composition matters more than quantity.

Projection Analysis

Projection of activation changes onto persona vectors. Vector steering with BILLY (green) shows superior control over prompting (red). Merged vectors successfully co-activate both personas, unlike single vectors.

To verify how effectively our method aligns model representations with target personas, we analyze the projection of activation changes per layer Δa(l) in Equation (4) onto layer-specific persona vectors.

The core idea is to measure the change in the model's activations caused by our steering vector and then calculate how much of the change occurred along the intended persona's direction.

Analysis Process:

Step 1: Define Activation Difference Vector

First, we define the activation difference vector, Δa(l), for each layer l. This vector is the difference between the steered activations a(l)_steered and the original activations a(l)_original from a standard forward pass without steering:

Equation (4): Activation difference calculation

Step 2: Calculate Projection Value

Second, to quantify the alignment of this change with a specific persona, we project the activation difference vector Δa(l) onto that persona's predefined, layer-specific unit persona vector. The resulting projection value, Projection(l)_persona, is calculated via the dot product:

Equation (5): Projection value calculation

where v(l)_persona is the persona vector for a given persona at layer l. A higher positive projection value indicates that our steering intervention more strongly shifted the model's activations along that persona's semantic axis.

Key Findings:

Vector Steering Superiority: While prompting fails to consistently induce all intended personas (negative projections on vENV), BILLY consistently achieves positive projections on both vCRE and vENV.
Superior Controllability: Direct latent space manipulation enables more reliable and interpretable multi-faceted persona generation compared to prompting approaches.
Selective Persona Control: Individual vectors (BILLY-CRE, BILLY-ENV) precisely control their intended semantic concepts with minimal cross-activation.

Hyperparameter Sensitivity Analysis

Our experiments adopted default settings of persona vectors steering as Chen et al. (2025), where layer=20, and α=2.0. We further conduct a more comprehensive study across 4 layers (12-24) and 7 α values (0.1–5.0) on all tested models, which are shown in Table 15 and Table 16. Below are the observations:

Sensitivity to layer: Across all models, Layer 20 consistently yields near-optimal performance, confirming it as a robust cross-model choice.
Sensitivity to α: Models exhibit high stability within the range of 1.0 ≤ α ≤ 2.5. Our chosen α = 2.0 falls within this optimal window, whereas extreme values (α = 0.1 or α = 5) lead to performance degradation.

Originality Scores

Originality scores across different models, layers, and α values.

Elaboration Scores

Elaboration scores across different models, layers, and α values.

Conclusion

We introduce BILLY, an efficient activation steering method that enhances LLM creativity while avoiding the high inference time and costs of multi-agent systems.

Extensive experiments on four benchmarks show that our method BILLY matches or surpasses strong creative baselines with significantly reduced inference time and costs (>95% reduction).

Furthermore, our qualitative and projection analyses show that different persona vectors influence distinct aspects of generation (style vs. content), and that BILLY offers superior controllability and interpretability compared to simple prompting methods.

Limitations & Future Work

While BILLY demonstrates the power of persona vector merging, our current approach uses simple averaging for vector composition. We recognize that developing a more sophisticated framework for composition is a key direction for future research.

Future work could explore advanced composition techniques, such as:

Learning task-specific weights for each persona vector.
Designing mechanisms that explicitly model the functional roles of different personas.
Investigating the non-linear interactions that occur when multiple vectors are combined.

Such advancements would support the development of more generalizable and robustly controllable models capable of complex, multi-faceted reasoning.

How to Cite

📄 Paper

[https://arxiv.org/abs/2510.10157]

💻 GitHub

[https://github.com/bai1026/LLM_Persona]

BibTeX

@inproceedings{pai2025billysteeringlargelanguage,
  title={BILLY: Steering Large Language Models via Merging Persona Vectors for Creative Generation}, 
  author={Tsung-Min Pai and Jui-I Wang and Li-Chun Lu and Shao-Hua Sun and Hung-Yi Lee and Kai-Wei Chang},
  booktitle={Conference of the European Chapter of the Association for Computational Linguistics},
  year={2026}
}

Why not directly use the positive activation vector?

Why these benchmarks?

Why only Originality and Elaboration?

Introduction

The Promise of Creative AI

Multi-LLM Paradigms: Promise and Limitations

The BILLY Solution

✨ Our Contributions

About BILLY

Approach

🎭 Persona Vector Extraction ?

⚡ Offline Fusion

🎯 Inference-time Steering

Benchmarks & Metrics

Creativity Benchmarks ?

Evaluation Metrics ?

Originality

Elaboration

Baselines

Single Agent (SA)

SA with High-temperature Decoding (T=1.0)

SA with Multi-Role Prompt (SA-MRP)

LLM Discussion

Main Results

Human Evaluation & Correlation

Analysis

Efficiency Analysis

BILLY (Ours)

LLM Discussion

Qualitative Result

Analysis of Various Vector Compositions

Projection Analysis

Hyperparameter Sensitivity Analysis

Originality Scores

Elaboration Scores

Conclusion

Limitations & Future Work

How to Cite

📄 Paper

💻 GitHub

BibTeX

🎭 Persona Vector Extraction

Creativity Benchmarks

Evaluation Metrics