Shamane Siri, PhD on LinkedIn: Amazing summarization of PEFT techniques. (2024)

Shamane Siri, PhD

Head of Applied NLP Research @ Arcee.ai

11mo

Report this post

Amazing summarization of PEFT techniques.

Like Comment

To view or add a comment, sign in

More Relevant Posts

Anand Chembeti

♡Tech - Big Data, AI, Healthcare| ☆Research/Consulting- Business/Market|2M+ Views

11mo
Report this post
Introducing GLoRA, the groundbreaking integration of Parameter Efficient Fine Tuning (PEFT) techniques into a unified equation. With GLoRA, you can now harness the power of multiple PEFT approaches in a single framework.GLoRA, or Global Regularized Optimization for Parameter-Efficient Fine-Tuning, is a new technique for fine-tuning pre-trained language models that combines the strengths of several different PEFT methods. GLoRA has been shown to achieve state-of-the-art results on a variety of natural language processing tasks, while using significantly fewer parameters than traditional fine-tuning.GLoRA works by first identifying a set of important parameters that are likely to be relevant to the downstream task. These parameters are then fine-tuned using a global regularization method, which helps to prevent overfitting. GLoRA also uses a novel technique called "parameter sharing" to further reduce the number of parameters that need to be fine-tuned.As a result of these techniques, GLoRA is able to achieve state-of-the-art results on a variety of natural language processing tasks, while using significantly fewer parameters than traditional fine-tuning. This makes GLoRA a promising technique for scaling up the use of pre-trained language models to new and challenging tasks.

3

2 Comments

Like Comment

To view or add a comment, sign in
Gunjan Narulkar

ML Engineering @ Google

11mo
Report this post
Amazing summary Prithivi! I wonder where (and if) knowledge distillation and model quantization find place in this space and how could they be connected with these pieces. This is the reason why getting LLMs to work AND generate the ROI is so complex. Looking forward to read more in this regard.

8

Like Comment

To view or add a comment, sign in
Rushikesh Meharwade

VP of Data Science at Motilal Oswal | Python | Machine Learning | MLOps | LLMs | Data Engineering | AWS Solution Architect

8mo
Report this post
Nice summarization of different PEFT techniques.

5

Like Comment

To view or add a comment, sign in
Prithivi Da

22M+ Model ↓ in 🤗 | Cited in NeurIPS, ICLR, ACL | 3K+ ⭐️ GitHub | Sharing wisdom from a 2 decade experience in creative problem solving.

11mo
Report this post
𝗣𝗘𝗙𝗧 (𝗣𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿-𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗙𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴) 𝘀𝗽𝗮𝗰𝗲 𝗶𝘀 𝘀𝗰𝗮𝗿𝗶𝗹𝘆 𝗯𝘂𝘀𝘆, Here is a mental model.Choosing a PEFT is simply matching them with your objectives.→ Prompt Tuning:𝗪𝗵𝗮𝘁: Prompt Tuning involves learning a set of continuous, trainable params that modify the pre-trained LLM's hidden states in response to task-specific prompts during inference, effectively fine-tuning the model at inference time.𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲: Prompt Tuning is a good choice when you have a large pre-trained LLM but want to fine-tune it for multiple different 𝙙𝙤𝙬𝙣𝙨𝙩𝙧𝙚𝙖𝙢 𝙩𝙖𝙨𝙠𝙨 𝙖𝙩 𝙞𝙣𝙛𝙚𝙧𝙚𝙣𝙘𝙚 𝙩𝙞𝙢𝙚 with minimal computational resources. It is also useful when you want to generate diverse and high-quality text outputs based on specific prompts.→ LoRA:𝗪𝗵𝗮𝘁: LoRA (Low-Rank Adaptation) is a technique that modifies the pre-trained LLM's attention mechanism during fine-tuning by introducing a low-rank matrix factorization that learns task-specific attention patterns.𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲: LoRA is a good choice when you want to fine-tune a pre-trained LLM for a specific 𝙙𝙤𝙬𝙣𝙨𝙩𝙧𝙚𝙖𝙢 𝙩𝙖𝙨𝙠 𝙩𝙝𝙖𝙩 𝙧𝙚𝙦𝙪𝙞𝙧𝙚𝙨 𝙩𝙖𝙨𝙠-𝙨𝙥𝙚𝙘𝙞𝙛𝙞𝙘 𝙖𝙩𝙩𝙚𝙣𝙩𝙞𝙤𝙣 𝙥𝙖𝙩𝙩𝙚𝙧𝙣𝙨. It is also useful when you have limited computational resources and want to reduce the number of trainable parameters in the model.→ Adapters:𝗪𝗵𝗮𝘁: Adapters are tiny NN modules that are added to pre-trained LLMs, typically between the pre-trained layers, to adapt the model to new downstream tasks. During fine-tuning, only the weights of the adapter are learned, while the pre-trained model's parameters remain fixed.𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲: When you need to fine-tune 𝙢𝙪𝙡𝙩𝙞𝙥𝙡𝙚 𝙙𝙤𝙬𝙣𝙨𝙩𝙧𝙚𝙖𝙢 𝙩𝙖𝙨𝙠𝙨 𝙤𝙣 𝙩𝙝𝙚 𝙨𝙖𝙢𝙚 𝙥𝙧𝙚-𝙩𝙧𝙖𝙞𝙣𝙚𝙙 𝙢𝙤𝙙𝙚𝙡. Additionally, Adapters are flexible and can be quickly and easily plugged into different parts of the pre-trained model without requiring major modifications.→ Prefix Tuning:𝗪𝗵𝗮𝘁: Prefix tuning involves adding a small trainable prefix to the input of the pre-trained LLM during fine-tuning, which modifies the representation learned by the pre-trained model to better suit the downstream task.𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲: When you want to fine-tune a pre-trained LLM for a specific downstream task and have limited computational resources when you want to 𝙢𝙤𝙙𝙞𝙛𝙮 𝙩𝙝𝙚 𝙧𝙚𝙥𝙧𝙚𝙨𝙚𝙣𝙩𝙖𝙩𝙞𝙤𝙣 𝙡𝙚𝙖𝙧𝙣𝙚𝙙 𝙗𝙮 𝙩𝙝𝙚 𝙥𝙧𝙚-𝙩𝙧𝙖𝙞𝙣𝙚𝙙 𝙢𝙤𝙙𝙚𝙡 for a particular task.Papers:• "Adapters” - https://lnkd.in/epXRCzRN• “LoRA”. https://lnkd.in/eFGq3yZW• "Prefix-Tuning”. https://lnkd.in/eJ9ixFpk• “Prompt Tuning” - https://lnkd.in/ezB5zM8QRepos:• LoRA https://lnkd.in/exAMvMfG• HuggingFace PEFT: https://lnkd.in/e7b-uzMN______________________Got some value from this post? Consider saving it using SaveLikeAPRO
397

19 Comments

Like Comment

To view or add a comment, sign in
Prithivi Da

22M+ Model ↓ in 🤗 | Cited in NeurIPS, ICLR, ACL | 3K+ ⭐️ GitHub | Sharing wisdom from a 2 decade experience in creative problem solving.

8mo
Report this post
𝗣𝗘𝗙𝗧 (𝗣𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿-𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗙𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴) 𝘀𝗽𝗮𝗰𝗲 𝗶𝘀 𝘀𝗰𝗮𝗿𝗶𝗹𝘆 𝗯𝘂𝘀𝘆, Here is a mental model.Choosing a PEFT is simply matching them with your objectives.→ Prompt Tuning:𝗪𝗵𝗮𝘁: Prompt Tuning involves learning a set of continuous, trainable params that modify the pre-trained LLM's hidden states in response to task-specific prompts during inference, effectively fine-tuning the model at inference time.𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲: Prompt Tuning is a good choice when you have a large pre-trained LLM but want to fine-tune it for multiple different 𝙙𝙤𝙬𝙣𝙨𝙩𝙧𝙚𝙖𝙢 𝙩𝙖𝙨𝙠𝙨 𝙖𝙩 𝙞𝙣𝙛𝙚𝙧𝙚𝙣𝙘𝙚 𝙩𝙞𝙢𝙚 with minimal computational resources. It is also useful when you want to generate diverse and high-quality text outputs based on specific prompts.→ LoRA:𝗪𝗵𝗮𝘁: LoRA (Low-Rank Adaptation) is a technique that modifies the pre-trained LLM's attention mechanism during fine-tuning by introducing a low-rank matrix factorization that learns task-specific attention patterns.𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲: LoRA is a good choice when you want to fine-tune a pre-trained LLM for a specific 𝙙𝙤𝙬𝙣𝙨𝙩𝙧𝙚𝙖𝙢 𝙩𝙖𝙨𝙠 𝙩𝙝𝙖𝙩 𝙧𝙚𝙦𝙪𝙞𝙧𝙚𝙨 𝙩𝙖𝙨𝙠-𝙨𝙥𝙚𝙘𝙞𝙛𝙞𝙘 𝙖𝙩𝙩𝙚𝙣𝙩𝙞𝙤𝙣 𝙥𝙖𝙩𝙩𝙚𝙧𝙣𝙨. It is also useful when you have limited computational resources and want to reduce the number of trainable parameters in the model.→ Adapters:𝗪𝗵𝗮𝘁: Adapters are tiny NN modules that are added to pre-trained LLMs, typically between the pre-trained layers, to adapt the model to new downstream tasks. During fine-tuning, only the weights of the adapter are learned, while the pre-trained model's parameters remain fixed.𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲: When you need to fine-tune 𝙢𝙪𝙡𝙩𝙞𝙥𝙡𝙚 𝙙𝙤𝙬𝙣𝙨𝙩𝙧𝙚𝙖𝙢 𝙩𝙖𝙨𝙠𝙨 𝙤𝙣 𝙩𝙝𝙚 𝙨𝙖𝙢𝙚 𝙥𝙧𝙚-𝙩𝙧𝙖𝙞𝙣𝙚𝙙 𝙢𝙤𝙙𝙚𝙡. Additionally, Adapters are flexible and can be quickly and easily plugged into different parts of the pre-trained model without requiring major modifications.→ Prefix Tuning:𝗪𝗵𝗮𝘁: Prefix tuning involves adding a small trainable prefix to the input of the pre-trained LLM during fine-tuning, which modifies the representation learned by the pre-trained model to better suit the downstream task.𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲: When you want to fine-tune a pre-trained LLM for a specific downstream task and have limited computational resources when you want to 𝙢𝙤𝙙𝙞𝙛𝙮 𝙩𝙝𝙚 𝙧𝙚𝙥𝙧𝙚𝙨𝙚𝙣𝙩𝙖𝙩𝙞𝙤𝙣 𝙡𝙚𝙖𝙧𝙣𝙚𝙙 𝙗𝙮 𝙩𝙝𝙚 𝙥𝙧𝙚-𝙩𝙧𝙖𝙞𝙣𝙚𝙙 𝙢𝙤𝙙𝙚𝙡 for a particular task.Papers:• "Adapters” -https://lnkd.in/epXRCzRN• “LoRA”.https://lnkd.in/eFGq3yZW• "Prefix-Tuning”.https://lnkd.in/eJ9ixFpk• “Prompt Tuning” -https://lnkd.in/ezB5zM8QRepos:• LoRAhttps://lnkd.in/exAMvMfG• HuggingFace PEFT:https://lnkd.in/e7b-uzMN__________________________________Save for later usinghttps://savelikeapro.app
250

12 Comments

Like Comment

To view or add a comment, sign in
Parwaz Dalvi

Engineering & Technology Leader - Data, AI-ML & Cloud | Enterprise Data Architecture | Digitization | Consulting | Business Transformation | Digital Evangelization | Speaker | Mentor | SME

11mo Edited
Report this post
Generative AI : The "What and When" of 1. Prompt Tuning, Low-Rank-Adoption (LoRA), 3. Adapters & 4. Prefix Tuning

7

Like Comment

To view or add a comment, sign in
Multiplatform.AI

1,015 followers

3w
Report this post
Sparse-Matrix Factorization Techniques: Enhancing Efficiency in CE Score Approximation#AI #artificialintelligence #CEmodels #computation #computationaloverhead #Efficiency #kNNsearchmethod #latentrepresentations #llm #machinelearning #similarityevaluation #Software #sparsematrixfactorization

Sparse-Matrix Factorization Techniques: Enhancing Efficiency in CE Score Approximation https://multiplatform.ai
Like Comment

To view or add a comment, sign in
Sakshi Zadi

Graduate

8mo
Report this post
Recently learned about #Feature Engineering for Machine learningFeature engineering is the process of using domain knowledge to extract the features from the raw data,these features can be used to improve the performance of Machine learning algorithms. Feature transformation is the most important process in feature engineering in which we can apply mathematical formula to a particular column (feature) & transform the values,for the further analysis of the model .Different types of Feature transformation:- Missing value imputation/treatment- Handling categorical values- Outlier detection- feature scaling * Imputation : It is the process of managing the missing values,which is the most important problems when it comes to prepare data for Machine learning.* Handling categorical values : It is the process of converting categorical data into a numerical format for the better understanding of Machine learning algorithm,also known as encoding. Thus ,the most important methods involving are:. One - hot encoding.. Label encoding.*Outlier detection : It refers to the process of identifying the data points that are different from the rest of the data in the dataset,which can cause issues in data analysis.* Feature scaling : The most important technique that facilitates the comparison of different types of data,which is useful for measurements to correct the model .It involves two important methods:1 . Standardization (Z-score) 2. Normalization . > min-max scaling > Mean normalization > Max absolute scaling > Robust scaling1. Standardization : The difference between the individual numbers and their mean,divided by the range of variation,called standard deviation(sigma).2. Normalization : It is quite similar to the standardization,except with the difference of each value from the mean, divided by the difference between maximum and minimum values in the given dataset.
8

Like Comment

To view or add a comment, sign in
Prashant Singh

Data Analyst

10mo
Report this post
Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent are three variations of gradient descent, which is an optimization algorithm used to train machine learning models. They differ in how they use training data to update model parameters during the optimization process. Here's a breakdown of each:Batch Gradient Descent:1. In batch gradient descent, the entire training dataset is used to compute the gradient of the cost function with respect to the model parameters in a single iteration.2. The gradients are calculated by considering all the training examples at once, which can be computationally expensive for large datasets.3. The model parameters are then updated based on the calculated gradients.Stochastic Gradient Descent (SGD):1. In stochastic gradient descent, a single training example is randomly selected from the dataset for each iteration.2. The gradient of the cost function is computed using only this one example, making each iteration faster.3. Since the updates are based on noisy estimates of the gradient, the optimization process can be noisy and exhibit more frequent fluctuations.4. Despite the noise, stochastic gradient descent can sometimes converge faster due to the frequent updates.Mini-Batch Gradient Descent:1. Mini-batch gradient descent is a compromise between batch gradient descent and stochastic gradient descent.2. The training dataset is divided into small batches, typically with sizes ranging from tens to hundreds of examples.3. In each iteration, a mini-batch is randomly sampled from the dataset, and the gradient is computed based on the examples in the mini-batch.4. The model parameters are then updated using the averaged gradients from the mini-batch.5. Mini-batch gradient descent combines the advantages of both batch and stochastic gradient descent. It leverages the computational efficiency of batch methods and the faster convergence of stochastic methods.In summary, the key differences are the amount of data used for gradient computation in each iteration and the frequency of parameter updates. Batch gradient descent uses the entire dataset per iteration, stochastic gradient descent uses one example at a time, and mini-batch gradient descent uses small batches of data. The choice between these methods depends on factors like dataset size, available computational resources, and convergence speed preferences.

2

Like Comment

To view or add a comment, sign in
2,121 followers

2mo
Report this post
Categorical variables play an important role in many different data sets. Most machine learning modelling algorithms, however, require numerical inputs, so we must find a way to “encode” the different categories. Generally speaking, categorical variables can be either:-Ordinal data: which is data that has a distinct order (e.g., level of education)-Nominal data: which does not have any intrinsic ordering or hierarchy (e.g., type of additive)There are a variety of different ways to handle categorical variables, each with its own advantages and disadvantages. Some of the more common ones are:Label encoding - assigns a unique integer to each category. This method is a simple (and obvious) choice for ordinal categorical variables where the order matters. This not suitable for nominal variables, however, as it introduces a false ordering of the categories and may mislead a model into learning incorrect relationships.One-hot encoding - converts each categorical value into a binary vector where each element represents the presence or absence of a particular category. One-hot encoding is straightforward and easy to implement, but it can lead to high-dimensional feature spaces, especially with variables containing many unique categories (high cardinality), which may introduce sparsity and increase computational complexity.Target Encoding - replaces categorical values with the mean of the target variable for each category. This method can capture the relationship between the categorical variable and the target variable, but it is prone to overfitting, especially with small or imbalanced datasets.Choosing the appropriate method depends on various factors such as the nature of the categorical variables, the size of the dataset, the complexity of the model, and the desired interpretability and performance of the model. In a future post we will take a closer look at target encoding and its advantages and disadvantages.#Categoricalvariables #variables #machinelearning #algorithms #encoding #data #dataset #complexity #mathematicalmodeling
19

Like Comment

To view or add a comment, sign in

3,260 followers

491 Posts
34 Articles

View Profile

Explore topics

Sales
Marketing
Business Administration
HR Management
Content Management
Engineering
Soft Skills
See All

Shamane Siri, PhD on LinkedIn: Amazing summarization of PEFT techniques. (2024)

More Relevant Posts

More from this author

Explore topics