- Forward Future AI
- Posts
- LLM Fighter: The New Way To Benchmark LLMs
LLM Fighter: The New Way To Benchmark LLMs
Mixtral vs. ChatGPT vs. Grok, which wins in Street Fighter?
LLM Coliseum is a new project where LLMs compete in real-team as a means to evaluate their performance
Developed by Stan Gerard and the Quiver Brain team, “LLM Coliseum” evaluates LLMs' performance in real-time using Street Fighter 3. The performance evaluation is based on speed, intelligence, adaptability, and resilience in a fast-paced gaming environment. Surprisingly, smaller, faster models like GPT-3.5 Turbo outperform larger ones such as GPT-4 in this context. This video provides a tutorial on installing the open-source project and proposes LLM Coliseum as a novel benchmark for testing LLMs.
Microsoft to build ‘Stargate’ supercomputer with millions of chips for OpenAI - Microsoft Corp. is developing a supercomputer for OpenAI nicknamed "Stargate," with plans to have millions of processors, targeting a release as early as 2028. This project, which may cost up to $100 billion, follows Microsoft's existing Azure supercomputer upgrades for OpenAI, now including tens of thousands of A100 chips. Microsoft has gradually escalated its AI infrastructure, which began with a 10,000 GPU system dubbed among the top five globally in 2020. Notably, the company's AI-focused efforts include a five-phase project, with Stargate slated for the final phase. Beyond OpenAI, Microsoft utilized this expanding supercomputing capability for its own AI development and offered it to cloud customers. While hardware has predominantly been Nvidia-based, there's potential for change with Microsoft's own Azure Maia accelerator. Stargate's completion will likely postdate OpenAI's GPT-5, anticipated soon with significantly enhanced features compared to GPT-4.
For Data-Guzzling AI Companies, the Internet Is Too Small - The rapid development of AI technologies by companies like OpenAI and Google is leading to a potential shortage of high-quality public data, as these increasingly powerful systems require vast amounts of information to learn from. To address this, companies are exploring new sources of data, such as transcriptions of public YouTube videos, and experimenting with synthetic data, despite the risk of causing malfunctions. The shortage of essential resources, including specialized chips, data centers, and electricity, along with restricted access to data from social media platforms and news publishers, poses additional challenges. Despite these hurdles, researchers remain optimistic that new solutions will emerge, drawing parallels to the way technological advancements have mitigated past concerns like "peak oil."
Biden orders every US agency to appoint a chief AI officer - The White House has implemented a comprehensive government-wide policy to address the risks and benefits of artificial intelligence (AI). Each federal agency is required to appoint a chief AI officer to oversee AI-related initiatives, ensuring uses of AI are safe, secure, and respect civil liberties and democratic values. By December 1, all non-compliant AI uses must be corrected. Officers have significant responsibilities, including assessing the impact on public safety and rights, maintaining minimum safety standards, and public engagement. They will form a Chief AI Officer Council to share practices and monitor advances in AI across agencies. These officers are also responsible for ensuring AI applications stand for privacy, equity, civil rights, and promote innovation, with certain allowances for waiving opt-outs.
US, Japan to call for deeper cooperation in AI, semiconductors, Asahi says - Japan and the United States are poised to announce enhanced cooperation in high-tech sectors, with a focus on artificial intelligence (AI) and semiconductors, during Prime Minister Fumio Kishida's visit to the U.S. on April 10. The planned joint statement will call for a "global partnership" and is expected to include the establishment of a framework for AI research and development, involving tech giants such as Nvidia, ARM, and Amazon. This initiative is part of the U.S.'s broader strategy to prevent China from accessing advanced AI technologies that could potentially be used to bolster its military capabilities. The collaboration underscores the strategic importance of AI in the technological and defense sectors, reflecting a deepening alliance between Japan and the U.S. in the face of global tech competition.
We’re Focusing on the Wrong Kind of AI Apocalypse - The discourse on AI's future often gravitates toward dramatic fears of Artificial General Intelligence surpassing human control, potentially resulting in mass unemployment or becoming uncontrollable. While acknowledging these concerns, the narrative advises shifting focus to immediate dangers, such as misinformation and deep fakes and stresses the urgency of making strategic decisions within organizations about AI's integration. Empirical evidence suggests AI significantly boosts productivity and affects high-skilled, creative jobs. Rather than reducing headcount for cost-saving, the article suggests leveraging AI to enhance meaningful work and suggests that managers adopt a proactive approach to restructure work positively around AI, ensuring it empowers rather than diminishes human workers. It emphasizes the role of various organizational levels in shaping AI's use and calls for early, widespread engagement to prevent being passive recipients of AI's transformative impact.
Amazon bets $150 Billion on Data centers required for AI Boom - Amazon plans to invest $150 billion over the next 15 years to maintain its lead in the cloud computing market and meet the growing demand for AI applications and digital services. The company will expand existing facilities and build new ones in various locations worldwide. However, the rapid growth of data centers faces challenges, such as securing sufficient electricity and dealing with opposition from local communities. Despite these challenges, Amazon remains committed to powering its operations with renewable energy and is exploring clean energy projects to match its energy needs with carbon-free power.
Gen-AI Search Engine Perplexity Has a Plan to Sell Ads - Perplexity, an AI search engine rivaling Google and backed by prominent investors including Jeff Bezos, is set to introduce native advertising. The search engine, which utilizes AI models, including OpenAI's GPT and its technology, will place ads within related questions that make up 40% of its queries. Although initially touting an ad-free vision, advertising was always a planned revenue path. Currently operating on a $20 monthly subscription, the one-year-old platform reports over 10 million monthly active users as of January. Advertisers see the potential in Perplexity's ad format, but the platform faces challenges, including scaling user numbers, maintaining brand safety, and proving effective targeting. Brand safety and the relevance of sponsored content are particular concerns for marketers considering investment in the platform's new advertising strategy.
21 nonprofits join our first generative AI accelerator - social impact teams are leveraging Generative AI to increase productivity, creativity, and cost-effectiveness in their community service efforts. Despite 80% of nonprofits recognizing its potential, nearly half are not utilizing generative AI due to various barriers. Google.org’s Accelerator program aims to address this by providing a six-month support program for nonprofits developing impactful generative AI applications. This includes training, mentorship, and over $20 million in funding. The initiative will benefit projects focused on climate, health, education, and crisis response. Selected nonprofits will receive substantial Google assistance in building their AI tools, such as an AI-powered caseworker assistant, personalized tutoring services, and tools for better job matching and legal guidance.
How AI Reshapes Vocabulary: Unveiling the Most Used Terms Related to the Technology - The website presents a glossary of AI-related terms, providing an understanding of contemporary AI language shaped by emerging technologies and their incorporation into everyday usage. Utilizing the News on the Web (NOW) Corpus, Dictionary.com, and AI-themed online content, the research identified and tracked the popularity of key terms such as Generative AI, prompts, AI models, AI bots, the GPT algorithm, open source AI, Large Language Models (LLMs), AI safety, AGI, Responsible AI, AI images, conversational AI, AI assistants, and AI art. It discusses AI's transformational impact on industries and the rise of specific roles, like prompt engineers. The surge in mentions of terms such as GPT, following OpenAI's releases, and the growth of LLMs are highlighted. The analysis also explores the ethical considerations in AI development.
Nvidia’s Blackwell Chip is Here - How will it affect the AI Landscape? - Nvidia is set to release its powerful next-generation AI processor, Blackwell, later this year. The GB200 GPUs are expected to be four times faster than the current H100 GPUs and could lead to significant advancements in AI technologies. Nvidia aims to create an entire supercomputer ecosystem, including hardware and software, to establish its dominance in the AI market. In response, companies like Alphabet, Qualcomm, and Intel are forming an alliance to reduce reliance on Nvidia's platforms.
Awesome Research Papers
The Unreasonable Ineffectiveness of the Deeper Layers - This study explores layer-pruning for open-weight pretrained LLMs, finding that up to 50% of layers can be removed with minimal performance degradation in question-answering tasks. The pruning process involves identifying optimal layers to remove based on similarity and applying parameter-efficient finetuning methods. The findings suggest that layer pruning can reduce computational resources for finetuning and improve inference efficiency. The robustness of LLMs to layer deletion raises questions about the effectiveness of current pretraining methods and the role of shallow layers in storing knowledge.
SelfIE: Self-Interpretation of Large Language Model Embeddings - SelfIE, an interpretation method, surpasses prior techniques by comprehending open-world concepts without training, enabling insight into large language models (LLMs) like LLaMA. For example, SelfIE discovered that LLaMA's one-word agreement in a trolley problem could reflect a preference for majority opinions. It revealed that prompt injections with exclamation marks cause LLaMA to sense urgency and lead to potentially harmful responses. In the realm of physics, SelfIE showed LLaMA's layer-by-layer process of associating "syrup" with "viscosity." It also traced LLaMA's "hallucination" to mistaken associations — a fictional name with unrelated real-world entities. Lastly, SelfIE uncovered LLaMA's ability to deduce mental states in complex social scenarios, demonstrating LLaMA's sophisticated interpretative abilities.
Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs - The paper discusses an advancement in on-device deployment of Large Language Models (LLMs), which are crucial for powering applications like intelligent assistants and translation on mobile devices. The conventional deployment is plagued with slow inference speeds, thus degrading user experience. The proposed solution includes four optimization techniques aimed at accelerating inference speed: a symbolic expression-based approach for dynamic models, operator optimizations, a novel FP4 quantization method (M0E4), and a sub-tensor method to streamline the KV cache process post-inference. These are integrated into the Transformer-Lite mobile inference engine, tuned for Qualcomm and MTK processors. Evaluations on models between 2B to 14B parameters showed significant speed improvements, boasting 10x faster prefill speeds and 2-3x faster decoding speeds than existing CPU and GPU solutions.
Introducing Jamba - Mamba is a novel Structured State Space model architecture developed to overcome the constraints associated with the traditional Transformer model. Despite its advancements, Mamba has its own limitations. Jamba is introduced as a solution that combines the strengths of both Mamba and the Transformer model, aiming to provide an optimal balance and enhance performance in tasks where both architectures are typically employed.
Announcing Grok-1.5 - Grok-1.5 is an upgraded AI model announced on March 28, 2024, featuring extended reasoning and problem-solving capabilities, particularly in coding and mathematics. It boasts impressive benchmark scores, outperforming its predecessor, Grok-1, on MATH, GSM8K, and HumanEval tests. Its ability to process up to 128,000 tokens is a significant enhancement, offering a greatly expanded memory capacity compared to earlier versions. The model has demonstrated excellence in retrieving information from extensive contexts, preserving its ability to follow instructions with larger datasets. Grok-1.5 operates on a sophisticated distributed training framework designed for large GPU clusters, emphasizing reliability and minimizing downtime through automatic node management and optimized checkpoints. This model is scheduled for early access on the 𝕏 platform, with anticipation building for its future features and broader user rollout.
Awesome New Launches
Navigating the Challenges and Opportunities of Synthetic Voices - OpenAI provides insights from a preview of Voice Engine, synthesizing natural speech from a 15-second audio sample. It's been used in applications by trusted partners in reading assistance and translation while preserving accents, supporting non-verbal individuals, and aiding those with speech impairments. However, the potential for misuse is taken seriously, especially regarding impersonation risks. Usage policies demand consent, and safety measures like watermarking are in place. OpenAI engages with stakeholders on the responsible use of synthetic voices, highlighting the need for public education and stronger verification systems and rethinking voice-based authentication due to potential AI deception.
Midjourney gets personalized models, v7 before summer, video by end of year - Midjourney is developing its AI image generator, with version 7 set to launch within the next three months. The most notable upcoming feature is the personalization of AI models, allowing users to influence the image generation process based on individual preferences captured through ratings. This personalization is expected to adjust the inherent bias in the models, affecting how the AI interprets unspecified details in prompts. In addition to improved image quality and aesthetics, Midjourney is exploring updates to improve body and hand depictions in the current version and plans to introduce 3D and video models later in the year.
Reply