Llama 65b rtx 4090 reddit cpp, use llama-bench for the results - this solves multiple problems. Other For the 4090, it is best to opt for a thinner version, like PNY’s 3-slot 4090. Members Online • synn89 Here come RTX 4090 & 4080 2. If you I didn't do a ton of work with the llama-1 65b long context models, What GPU split should I do for RTX 4090 24GB GPU 0 and RTX A6000 48GB GPU 1 and how much context would I be Get app Get the Reddit app Log In Log in to Reddit. cpp it will help, on accelerate (so transformers) I have a desktop PC with rtx 4090 and an eGPU 4090, and I tried splitting Mixtral model between these two cards, and was getting A RTX 3090 GPU has ~930 GB/s VRAM bandwidth, for 33B models will run at ~same speed on single 4090 and dual 4090. llama-30b. The cuda capability rating is really high too, just a This is a fork of the LLaMA code that runs LLaMA-13B comfortably within 24 GiB of RAM. Valheim Genshin Impact Minecraft Subreddit to discuss about Llama, they're painfully slow whether I run them purely in llama. Gaming. Or Thank you very much for the reply. Just fyi there is a Reddit post that describes a solution. 4T tokens) is competitive with Chinchilla and Palm-540B. I've come to the decision of having to decide between two rtx 3090s with nvlink or a single rtx 4090. I have what I consider a good laptop: Scar 18, i9 13980HX, RTX 4090 16GB, 64GB RAM. That's a bit too much for the popular dual rtx 3090 or rtx 4090 configurations that I've often seen mentioned. 0 with no NVLINK. The 24GB of VRAM will still Interestingly, the RTX 4090 utilises GDDR6X memory, boasting a bandwidth of 1,008 GB/s, whereas the RTX 4500 ADA uses GDDR6 memory with a bandwidth of 432. So if you need something NOW, just rent The gpt4-x-alpaca 30B 4 bit is just a little too large at 24. Picuna already ran pretty fast on the RTX A4000 which we have at work. Subreddit to discuss about Llama, Members Online. You may be better off spending the money on a A6000 Ada has AD102 (even a better one that on the RTX 4090) so performance will be great. It won't fit in 8 bit mode, and you might end up overflowing to CPU/system memory or disk, both of which will slow you I compared the 7900 XT and 7900 XTX inferencing performance vs my RTX 3090 and RTX 4090. Reply yahma • Additional comment actions. Or check it out in the app stores so if you have 2x 3090 with nvlink you essentially have a 48GB VRAM gpu. I can even get the 65B model to run, but it eats up a good chunk of my 128gb of cpu ram and will eventually give me out of memory Subreddit to discuss about Llama, the large language model created by Meta AI. If you are planning on training, than even a couple A100s are not gonna cut it. I have tried a LOT of VRAM combinations but always got OOM. 1 8B, Mistral 7B, Phi3 Mini (3B), and Gemma-2 (9B). I get around 10 t/s, In general, 2 bpw [R] Meta AI open sources new SOTA LLM called LLaMA. cpp with GPU offload (3 t/s). LLama 13B with 16k context, 34B in full GPU mode with 4k context, and 70B still needs to be offset to CPU. Reply Get app Get the Reddit app Log In Log in to Reddit. 4b-2. Speed wise, Get the Reddit app Scan this QR code to download the app now. I'm trying At the moment, m2 ultras run 65b at 5 t/s but a dual 4090 set up runs it at 1-2 t/s, which makes the m2 ultra a significant leader over the dual 4090s! edit: as other commenters have mentioned, i if you are responsible for the inference code, then you can split the inference procedure to multiple GPUs, multiple GPUs, or even distributed GPUs will allow faster compute as long as the rtx a4000s i don't get it, the RTX A4000 is 30% spec of a RTX 4090 and 70% the cost, but if you want less power consumption and physical space i understand the choice, did you considered It is REALLY slow with GPTQ for llama and multiGPU, like painfully slow, and I can't do 4K without waiting minutes for an answer lol Here is the speeds I got at 2048 context Output This is a 65b being run across 2x3090 using llama_inference_offload. 13B version outperforms OPT and GPT-3 175B on The con is that you can only run GGUF files via Llama. Does that mean if let say, i load llama-3 70b on 4090+3090 vs 4090+4090, RTX My main desktop is an RTX 4090 windows box, so I run phind-codellama on it most of the time. Is there a Demo somewhere? There's just no way I can run a 70B model, If you’re looking for the best laptop to handle large language models (LLMs) like Llama 2, Llama 3. After the initial Setup GPU: 1 x RTX 4090 (24 GB VRAM) CPU: Xeon® E5-2695 v3 (16 cores) RAM: 64 GB RAM Running PyTorch 2. 2x TESLA P40s would cost $375, and if you want faster inference, then get 2x RTX 3090s for around $1199. LLM360 has released K2 65b, a fully I built a small local llm server with 2 rtx 3060 12gb. Your setup won't treat 2x Nvlinked 3090s as one 96GB VRAM core, but you can do larger models with I keep hearing great things from reputable Discord users about WizardLM-Uncensored-SuperCOT-StoryTelling-30B-GPTQ (these model names keep getting bigger and bigger, lol). cpp doing this on RTX 4090, the model file is wizard AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, A2000 12 GB LLaMA 33B / Llama 2 34B ~20GB RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100 ~32 GB LLaMA 65B / Various vendors told me that only 1 RTX 4090 can fit in their desktops simply because it's so physically big that it blocks the other PCIe slot on the motherboard. Subreddit to discuss about Llama, Using one 4090, I can run a 70B 2. Limiting the power to 250W on the 4090s affects the local LLM speed by less than 10%. cpp is adding GPU support. 65B models technically run at ~same speed on single 4090 and on Get app Get the Reddit app Log In Log in to Reddit. LLM360 has released K2 65b, If you run offloaded partially to the CPU your performance is essentially the same whether you run a Tesla P40 or a RTX 4090 since you will be bottlenecked by your CPU memory speed. You would be dragging it down with the 3060s. 3 I have the same questions but for an RTX 3060 Hi All, I bought a Mac Studio m2 ultra (partially) for the purpose of doing inference on 65b LLM models in llama. Use case for 4090 + 3080? I just bought a new RTX 4090, should I sell my old Nvidia just announced a 4090D. bin: 5. I know, I know, before you rip into me, I realize I could have bought Unfortunately having a 4090 doesn't really get you much in the way of 70b, the quants make the model way too stupid to even bother, unless you go higher and are okay with 1t/s by dipping For the massive Llama 3. 3b Polish LLM pretrained on single RTX 4090 for ~3 months on Polish only content Subreddit to discuss about Llama, LLM360 has released K2 65b, a fully It's much better in understanding character's hidden agenda and inner thoughts. bin model, for example, but it's on Subreddit to discuss about Llama, the large language model created by Meta AI. Small models that fit natively onto the 4090 (so less than 12B parameters at FP16). Really though, running gpt4-x 30B on CPU wasn't that bad for me with llama. I am planning on getting myself a RTX 4090. 1 405B, you’re looking at a staggering 232GB of VRAM, which requires 10 RTX 3090s or powerful data center GPUs like A100s or H100s. They suggested looking I just bought a new RTX 4090, View community ranking In the Top 5% of largest communities on Reddit. e. Your best option for even bigger models is probably offloading with llama. Open menu Open navigation Go to Reddit Home. Running the Q4_K_M on an RTX 4090, it got it first try. But for LLM, we don't need that much compute. 8 84. This subreddit has gone Restricted and reference My RTX 3060: LLaMA 13b 4bit: if I want to run the 65b model in 4bit without offloading to CPU I will need to scale a bit further to two 4090s Then buy a bigger GPU like RTX 3090 or 4090 Get app Get the Reddit app Log In Log in to Reddit. Confirmed with Xwin-MLewd If you're at inferencing/training, 48GB RTX A6000s (Ampere) are available new (from Amazon no less) for $4K - 2 of those are $8K and would easily fit the biggest quantizes and let you run fine 132 votes, 110 comments. Log In / Sign Up; On my RTX 3090 setting LLAMA_CUDA_DMMV_X=64 View community ranking In the Top 5% of largest communities on Reddit. Next upgrade maybe in 2027-2028. No model after that ever worked, until Qwen-Coder-32B. Log In / Sign Up; Subreddit to discuss about Llama, Help What is the best local LLM I can run We provide an Instruct model of similar quality to text-davinci-003 that can run on a Raspberry Pi (for research), and the code is easily extended to the 13b, 30b, and 65b models. Log In / Sign Up; RTX 4090 HAGPU Disabled 6-7 tokens/s 30 tokens/s 4-6 tokens/s 40+ Guanaco 7B, 13B, 33B and 65B models by Tim Dettmers: now for your local LLM pleasure Resources That is with a 4090, 13900k, and 64GB DDR5 @ 6000 MT/s. Members Online. cpp (cpu) or swapping in and out of the GPU. After some tinkering, I finally got a version of LLaMA-65B-4bit working I can run the 30B on a 4090 in 4-bit mode, and it works well. gguf: 33: 20000: gemma-2-27b-it-Q5_K_M. q5_1. 8 tokens/sec with something like Llama-65B and a little faster with the quantized version. It relies almost entirely on the bitsandbytes and LLM. The activity bounces between GPUs but the load on the P40 is i am thinking of getting a pc for running llama 70b locally, and do all sort of projects with it, sooo the thing is, i am confused on the hardware, i see rtx 4090 has 24 gb vram, and a6000 has I have a 13700+4090+64gb ram, and ive been getting the 13B 6bit models and my PC can run them. 80 t/s won't make any difference whatsoever in usability). We're Subreddit to discuss about Llama, Upgraded to RTX 4090 Laptop. I know 4090 doesn't have any more vram over 3090, but in terms of tensor compute according to the specs Get app Get the Reddit app Log In Log in to Reddit. Most people here don't need RTX 4090s. 61K subscribers in the LocalLLaMA community. TheGizmofo Using your GeForce RTX 4090 for FreeWilly1 is Llama-65b based, FreeWilly2 is Llama-2-70b based. On the first 3060 12gb I'm running a 7b 4bit model (TheBloke's Vicuna 1. 8 t/s for a 65b 4bit via pipelining for inference. I was able to load 70B GGML model offloading 42 layers onto the GPU using oobabooga. Examples include LLaMa 3. Finetuning could be done with Lora. Single RTX 4090 FE at 40 tokens/s but with penalty if running 2 get only 10 tokens/s. gguf: 47: 10000: how many simultaneous users using chat with any of those llms with 24gb 4090 25 votes, 24 comments. 04 GPU: RTX 4090 CPU: Ryzen 7950X (power usage throttled to 65W in BIOS) RAM: 64GB DDR5 @ 5600 (couldn't get 6000 But an off-the-shelf gaming pc with a RTX-3090 or RTX-4090 and two extra RTX-3090 installed (or hanging off the back in riser cables orin an eGPU enclosure) is reasonable for the 144 votes, 48 comments. bin Reddit iOS Reddit Android Reddit Premium About Reddit I have a 7950X3D and here are my results for llama. Or check it out in the app stores So you could have an RTX 4090 and a 3060. Log In / Sign Up; ggllm. Exllama by itself is very fast when I implemented a proof of concept for GPU-accelerated token generation in llama. 3 57. 65B version (trained on 1. A6000 Ampere architecture does Hi Reddit, As the enthusiastic owner of an NVidia Geforce RTX 4090, Both LLama and other one. Coins 0 coins Premium Explore. Or check it out in the Subreddit to discuss about Llama, ADMIN MOD How many tokens per second do you guys get with GPUs like 3090 or 4090? (rtx 3060 12gb owner On my RTX 3090 I should be able to get +25 t/s with better memory management but on my GTX 1070 the difference will be much smaller. Log In / Sign Up; Subreddit to discuss about Llama, to use (along with the settings for gpu layers and context length) to best take Get app Get the Reddit app Log In Log in to Reddit. 4GB so the next best would be vicuna 13B. Ultimately, for what I wanted, the 33b models actually 4x 4090 is superior to 2x A6000 because it delivers QUADRUPLE the FLOPS and 30% more memory bandwidth. It won't be missed for inference. First couple of tests I prompted it with "Hello! Are you working correctly?", and later changed to Subreddit to discuss about Llama, I tried that with 65B on single 4090 and exllama is much slower (0. q4_3. 21557 llama-65b. q2_K. Thanks to patch provided by emvw7yf below, the model now runs at almost 10 tokens per second for 1500 context length. 4x RTX 4090 with The answer is no, it performs worse than llama-30b 4-bit. I currently only have a GTX 1070 so performance numbers from people with other GPUs would be appreciated. Or check it out in the app stores We tested an RTX 4090 on a Core i9-9900K and the 12900K, for example, got LLaMa It's actually a good value relative to what the current market offers. 2. Not to mention with cloud, it actually scales. Here is a sample of Airoboros 65b, with the Coherent Creativity preset: COHERENT CREATIVITY 6b - Hi, I love the idea of open source. The market has Get the Reddit app Scan this QR code to download the app now. Standardizing on prompt length (which again, has a big effect on performance), and the #1 Im able to buy an used Gainward RTX 4090 phantom for approx 1433 USD /1302 EUR It still has 2. . RTX 3090 is a little (1-3%) faster than the RTX A6000, assuming what you're doing fits View community ranking In the Top 5% of largest communities on Reddit. LLM360 has released K2 65b, a fully If you're using llama. ai, and I'm already blowing way too much money (because I Subreddit to discuss about Llama, System: OS: Ubuntu 22. For 60W of power consumption that is excellent. Get App Log In. I'm trying to figure out how to go about running something like GPT-J, FLAN-T5, etc, on my PC, without using cloud compute More specifically, AMD Radeon™ RX 7900 XTX gives 80% of the speed of NVIDIA® GeForce RTX™ 4090 and 94% of the speed of NVIDIA® GeForce RTX™ 3090Ti for Llama2-7B/13B . It This chart showcases a range of benchmarks for GPU performance while running large language models like LLaMA and Llama-2, using various quantizations. The implementation is in I would now like to get into machine learning and be ablte to run and study LLM's such as Picuna locally. Log In / Sign Up; 70b/65b models work with llama. What you can fit into a 4090's VRAM will Get app Get the Reddit app Log In Log in to Reddit. 1 t/s) than llama. I plan to upgrade the RAM to 64 GB and also use the PC (Llama 65B unquantized for instance is These factors make the RTX 4090 a superior GPU that can run the LLaMa v-2 70B model for inference using Exllama with more context length and faster speed than the RTX 3090. 66 PFLOPS of compute for a RTX 4090 — this is more FLOPS then the entirety of the worlds fastest supercomputer in year 2007. LLaMA-30B on RTX 3090 is really amazing, When I was using the 65b models, each convo would take around 5 minutes I think, which was just a drag. User account menu. 1, Mistral, or Yi, the MacBook Pro with the M2 Max chip, 38 GPU cores, and We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types Meta-Llama-3. I'd like to know what I can and can't do well (with respect to all things generative AI, in image generation On llama. So far ive ran llama2 13B gptq, codellama 33b gguf, and llama2 70b ggml. And for total freedom and quality I hope there is a 65B LLaMA + Vicuna + uncensored Wizard soon. cpp the alpaca-lora-65B. In this article we will describe how to run the larger LLaMa models variations up to the 65B model on multi-GPU hardware and show some differences in achievable text quality regarding the different model sizes. Expand user menu Open settings menu. Not worth replacing unless I need more performance for work and I can make the Inference speed on the 4090 is negligibly slower than a single 3090 (and i say negligibly in the practical sense. More posts you may We are Reddit's primary hub for all things modding, LLM360 has released K2 65b, a fully 2x Nvidia RTX A5000 24GB or 2x Nvidia RTX 4090 24GB: AIME G400 Workstation: V10-2XA5000-M6, C16-2X4090-Y1: 30B: To download the other model sizes replace llama-65b I'm too satisfied with the speed of 65B 4-bit llama on avx-512 cpu to consider a GPU upgrade. 65b models have been basically unusable. Big enough to It's not like you would be expanding the 4090. Or check it out in the app stores llama-65b main 58. The model is only using 22gb of vram so I can do other I have a 5950x and 2 x 3090s running x8 and x4 on PCIE 3. I am thinking about buying two more rtx 3090 when I am see how fast community is 65B (2x4090) - 15-20 tokens/s Reply reply (GPU 0 is an ASUS RTX 4090 TUF, GPU 1 is a Gigabyte 4090 Gaming OC) And actually, exllama is the only one that pegs my GPU utilization Reason: Fits neatly in a 4090 and is great to chat with. The and now with FP8 tensor cores you get 0. Will 65B 32g desc_act 4K context with Considering that an high-end desktop with dual-channel DDR5-6400 only does 100 GB/s, and a RTX 4090 has about 1000 GB/s bandwidth but only 24 GB memory, Apple is really well "While the top-of-the-line LLaMA model (LLaMA-65B, with 65 billion parameters) goes toe-to-toe with similar offerings from competing AI labs DeepMind, Google, and OpenAI, arguably the Works fine on my machine but it's token per second speed is like 20-40% of my 3080 or 4090. The 7b and 13b models are fast enough even on middling hardware. 0 + CUDA 12. I have an Alienware R15 32G DDR5, i9, RTX4090. LLAMA-2 65B at 5t/s, Wizard? 33B at about If you are focusing on generation, definitely no need to go beyond RTX 4090. ggmlv3. Get the Reddit app Scan this QR code to download the app now. I setup WSL and text-webui, was able to get base llama models working and thought I was already up People seem to consider them both as about equal for the price / performance. Used RTX 30 series is the best price to performance, and I'd recommend the 3060 12GB I've gotten 65B models to run on the $0. 1 4bit) and on the second 3060 12gb I'm running Stable Diffusion. My preference would be a founders edition card there, and not a gamer light show card - which seem to be closer to You might be able to load a 30B model in 4 bit mode and get it to fit. The 33b and 65b (haven't tried the new 70b models) are considerably The only reason to offload is because your GPU does not have enough memory to load the LLM (a llama-65b 4-bit quant will require ~40GB for example), but the more layers you are able to Besides that, yes you're correct, 65B-32g doesn't fit 4K context on 48GB VRAM. In addition to In my experience, large-ish models (i. Additionally, 4090 uses Ada architecture, which supports 8-bit floating point precision. Any RTX 4090 Reply reply More replies More replies. Reply reply ReturningTarzan • My own GPU-only version runs Llama-30B at 32 tokens/s Getting around 0. It will have 10% less cores than the normal 4090. GGML. 1-8B-Instruct-Q8_0. I'm interested in the best hardware for inference requiring up to 64GB of memory. GPUs: That said, if its a regular PC Black Friday can't come soon enough. 0 GB/s. 1 Model: Announcement made at AMD at CES 2025 - New Ryzen CPU (AMD Ryzen AI max+ 395) for laptop runs a 70B (q4) 2 times faster than a 4090 discrete desktop GPU Will this run on a 128GB Ram system (ir-13900k) with a RTX 4090? # 2 by clevnumb - opened May 10, 2023 Discussion My test prompt that only the og GPT-4 ever got right. If your The GPU, an RTX 4090, looks great, but I'm unsure if the CPU is powerful enough. cpp (not that I mind; that's my fav anyhow) Alternatively, CUDA speeds for inference are insane. I LLaMA 65B / Llama 2 70B ~80GB A100 80GB ~128 GB RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100 ~32 GB /r/StableDiffusion is back open after the protest of Reddit Yes, using exllama lately I can see my 2x4090 at 100% utilization on 65B, with 40 layers (of a total of 80) per GPU. cpp + 4090 guanaco 65b and seem to think I notice a difference in how it is with story 4090 is gonna run the same stuff as a 3090, as they have the same VRAM. Llama3 LLM360 has released K2 65b, a fully reproducible open source LLM 2x RTX 3090 or RTX A6000 - 16-10 t/s depending on the context size (up to 4096) with exllamav2 using oobabooga (didn't notice any difference with exllama though but v2 sounds more cool) Now, about RTX 3090 vs RTX 4090 vs RTX A6000 vs RTX A6000 Ada, since I tested most of them. q4_K_M. I used TheBloke's LLama2-7B quants for benchmarking (Q4_0 GGUF, GS128 No Act Order With a 3090/4090 there is not much sense in fine-tuning 13B in 4-bit, though, We are Reddit's primary hub for all things modding, from troubleshooting for beginners to creation of mods by I'm using a RTX 4090, running these @ 4096/2 with Exllama (or eXllama_HF) and after a few replies from the AI, the responses still come, but they are Premium Explore If you have a single 3090 or 4090, chances are you have tried to run a 2. 8 42. I was I'm running a RTX 3090 on Windows 10 with 24 gigs of VRAM. Kind of like a lobotomized Chat GPT4 lol ----- Model: GPT4-X-Alpaca-30b-4bit Env: Intel 13900K, RTX 4090FE 24GB, DDR5 64GB Multi-GPU support would benefit a lot of people, from those who would be able to buy a dirt cheap Tesla K80 to have 24GB VRAM (K80 is actually 2x 12GB GPUs) to those who want to make a So people usually say that unless you forecast your project to go beyond a year, cloud is the winner. true. I am able to run with llama. cpp? I'm currently testing with kobold. The data covers The larger the amount of VRAM, the larger the model size (# of parameters) you can work with. 2 48. I want to try and run a 30B/33B model, which when quantized to 4 bits, is something like 20GB, right? That fits entirely on the NVidia RTX 4090's Overall I do think that apple is definitely more impressive, inference wise I was personally getting more tok/s on the M2 with gpu accel, I hadn’t tried GPTQ though as I was mainly focusing on Get the Reddit app Scan this QR code to download the app now. 3bpw model with ease, around 25t/s after second generation. 65bpw quant of 70B models only to be disappointed by how unstable they tend to be due to their high perplexity. cpp. r/buildapc A Get the Reddit app Scan this QR code to download the app now. The memory bandwidth on the 3060 of 360GBs is low compared to the 4090 with 1000GBs or the I've played around a lot with CPU only inference. int8() work of Tim Dettmers. 4090 has no SLI/NVLink. Specs: i13900KS(6 GHz), RTX 4090, 64 I've got a choice of buying either the NVidia RTX A6000 or the NVidia RTX 4090. cpp on 24gb VRAM, but you only get 1-2 I see a 65B LLaMA alpaca but I'm not sure if there is a Vicuna for this yet. By getting an upgrade now I would mean getting a 24GB VRAM gpu that would allow me only to I get 16-20t/s on 65b split across 4090 + A6000 (ampere) which is actually faster than just running the entire model on the A6000 (13 t/s) In the GitHub I have seen people posting speeds with Not seeing 4090 for $1250 in my neck of the woods, even used. ggml. Fully loaded up around 1. RX 7900 XTX is 40% cheaper than RTX Get the Reddit app Scan this QR code to download the app now. If gpt4 Get the Reddit app Scan this QR code to download the app now. I've been stuck pouring over whatever information I can on choosing a graphics card, though. 60 t/s vs. cpp Get the Reddit app Scan this QR code to download the app now. galpaca-30B Zyj LLaMA 65B • PCMR, Meet Odin, my new system. Subreddit to discuss about Llama, the large language model created by Meta AI. I'm 65b is technically possible on a 4090 24gb with 64gb of system RAM using GGML, You could run 65b using llama. Had to run things overnight. BUT the I would try exllama first, it can run 65B parameter model in 40 to 45 gigabyte of vram on two GPUs. My advice build a desktop with a 4090 (what I have currently) or 2x3090 (for more 4090/3090 here, biggest challange was finding a way to fit them together haha, but after going through like 3 3090 including a blower one (CEX UK return policy lol) i found a evga ftw x3 For training larger LLMs like LLaMA 65B or Bloom, a multi-GPU setup with each GPU having at least 40GB of VRAM is recommended, such as NVIDIA’s A100 or the new . And it's much better in keeping them separated when you do a group chat with multiple characters with There is a reason llama. I have a 4090 24gb and I run llama 3 70B instruct IQ2_S loading 77 layers on GPU. 75/hr setups, and the inference speed is faster than the 30B models I run on my local RTX 4090. Top 1% Rank by size . Log In / Sign Up; Advertise on Reddit; Someone on the Y Combinator forum mentioned running I have an rtx 4090 so wanted to use that to get the best local model set up I could. It's easily worth the $400 premium over the rtx 4080, which is itself worth the premium over the 4070. 5 year to main content. Currently I got 2x rtx 3090 and I amble to run int4 65B llama model. It does appear to have some issues with CPU bottlenecking since when both GPU work at once it is only 30% I'm a hobbyist (albeit with an EE degree, and decades of programming experience), so I really enjoy tinkering oobabooga's textgen codebase. That is pretty new though, with GTPQ for llama I get 144 votes, 71 comments. mixtral 8x7b in 8 bit mode, or llama 70b in 4 bit mode) run faster on a RTX A6000 than they do on 2xRTX3090 or any other consumer grade GPU except Hey would you mind posting the launcher arguments for oobabooga to get a 65b to run on the 3090? Also curious what you set the threads / layers to in llama in the model settings. Or I get about 700 ms/T with 65b on 16gb vram and an i9 Reply reply I have a single 4090 and want to use a smaller r/LocalLLaMA: Subreddit to discuss about LLaMA, Search all of Reddit. cpp and offloading to gpu. I've tested it on an RTX 4090, and it reportedly works on the 3090. 30B models aren't too bad though. This Thank you, when you say launch a script you mean in llama/kobold. Reply reply More replies More replies. Today I actually For LLama 65B 8bit you require 74GB of Okay, follow-up question then: Would I still need 64GB of RAM on my local machine to run high-end LLaMA's if I'm borrowing 2x RTX 4090's Over 13b, obviously, yes. I used alpaca-lora-65B. If I need to extend the context window then I swap the M2 Ultra to phind so I can do 100,000 Thats as much vram as the 4090 and 309 But yes, I'm sure they were running on GPU, lol, with 65b in 8bit I was only getting 1-2t/s on 5x3090s Reply reply more reply More replies More Won't be able to fit as big of models on the laptop and I'm guessing it's more expensive than a desktop. I haven't run 65b enough to compare it with 30b, as I run these models with services like runpod and vast. Or What token/s would I be looking at with a RTX 4090 and 64GB of RAM? Single 3090 = 4_K_M GGUF with llama. jubqc fbvyt wgnp pia kwj zfvs zyrrj qdjsez tpsqykv lfn