Chương trình Nghiên cứu AI Safety Mùa Thu 2025

(Cơ hội này sẽ hoàn toàn trong tiếng Anh, do bản chất của chương trình nghiên cứu cần đọc tài liệu tiếng Anh chuyên ngành. Tuy nhiên, bạn vẫn có thể giao tiếp với điều phối viên HAISN bằng tiếng Việt nếu muốn).

Applications is now open - apply here.

Antoan.AI is partnering with Hanoi AI Safety Network on the Fall Research Fellowship (FRF) to offer undergraduate/graduate students and early-career/mid-career professionals the opportunity to engage in AI safety research and join a community of people working on AI safety. Previous fellows have published their research projects at ICLR, ICML, and AAAI.

FRF projects span safety evaluations, interpretability, AI control, adversarial robustness, and more. You will have the chance to express your interests and preferred project further into the applications.

The FRF will run for 1-3 months starting August, depending on the nature of your project and participants’ availabilities. The FRF can be participated part-time and remotely. Funding is available for GPU compute and AI subscriptions.

You might be a good fit if:

You are interested in making advanced AI systems safe and trustworthy.
You have a technical background (e.g. ML, CS, cybersecurity, maths, physics, neuroscience, etc).
You are curiosity-driven and willing to pursue your own ideas.

Plus points (although not a requirement)

You have or plan to pursue a career in research.
Deep understanding of AI safety arguments and landscape.

The application process includes filling out a short form and a chat with one of our organisers. We will review applications on a rolling basis, and the deadline to apply is Friday August 8th 23:59 AOE. For more information, fill out this form.

Example AI safety projects below. Feel free to express interest in any of these, or propose your own!

AI models are really good at knowing where you are from a single image, or inferring your personal information purely from a paragraph of text. This is a concerning dangerous capability, both from a privacy perspective, and from our lack of understanding of it. Can we develop a better understanding of this phenomenon and models’ full capabilities? For example, can they guess locations of indoor images as well? Can we use circuit tracing to understand how they arrived at the guesses? Do their explanations match this, or are they unfaithful?

Current AI systems often know when they are being evaluated. This is concerning, as it undermines the reliability of safety evaluations and allows for strategic deception, such as sandbagging (intentionally doing badly on evaluations). Blackbox approaches, i.e. simply asking the models when they are aware, might introduce unwanted signals, or might not be effective if models are sandbagging. Prior research points to interpretability tools as a promising avenue. Can we create model organisms of evaluation awareness sandbaggers through finetuning, then demonstrate that whitebox approaches such as probing or noising outperform purely blackbox?

The future will likely include many AI agents running wild on the internet. Currently, it is known that if you put multiple AI agents into the same situation, weird and concerning things happen (infectious jailbreaks, Claude bliss attractor, etc). Can we develop a more comprehensive understanding of multi-agent dynamics, for example, by using activation engineering to discover cooperative traits, or by finding instances of ‘attractor states’ in misaligned models?

Tham gia Discord