Email:      Pass:    Pass?
Advertisements


Free Excel Forum

How Do I Calculate Outliers In Excel?

Forum Register
Search Excel Forum Posts, Tutorials, Macros, Tips, and More

Is there a way for excel to automatically determine outliers in a given data
set and exclude them from future calculations?



Similar Excel Video Tutorials

Helpful Excel Macros

Calculate the Future Value (FV) of Compound Interest in Excel - UDF Macro
- This UDF (user defined function) calculates the Future Value of Compound Interest in Excel. The mathematical formula th
Filter Data to Show the Bottom X Percent of the Data Set in Excel - AutoFilter
- This free Excel macro filters data to display the bottom X percent of the data set in Excel. This is a great macro to u
Filter Data in Excel to Show Only the Top X Percent of that Data Set - AutoFilter
- This Excel macro filters a set of data in Excel and displays only the top X percent of that data. This means the highes
Complete Guide to Printing in Excel Macros - PrintOut Method in Excel
- This free Excel macro illustrates all of the possible parameters and arguments that you can include in the PrintOut Meth
Filter Data to Display the Results that Begin With Specified Text or Words in Excel - AutoFilter
- This Excel macro automatically filters a set of data based on the words or text that are contained in the beginning of t

Similar Topics







hi,

i have a scatter plot for which i want to insert a best fit line. however, there are some outliers within the data. i have manually identified these outliers and coloured them differently from the non-outliers, but what i would like to do now is to insert a best-fit straight line for the remaining good data, while still keeping the outliers on the scatter plot. is there a way to do that?

i found this post:
http://thedailyreviewer.com/office/v...iers-110140914
but do not understand how to "fit the trend line based on the new set of points, but graph the old set".

thanks for any help.


I've been racking my brain over this all weekend. Can you help me determine the best minimum time excluding outliers?

For example:

Dave 15
Jim 10
Mike 12
John 5
Jill 12
Kim 13
Joey 21
Ralph 6
Sue 20

I would consider 5, 6, 20, and 21 outliers. I need to remove them to show the minimum (or best time) which in this case would be 10.

Thanks,

Chad


does anyone know if I can get outliers from a histogram, or do I need to do a
scatter plot? I am using excel (most recent version) for mac.



I have lots of data, mostly triplicate results, with some outliers. I would
appreciate help on applying Thompson tau or other outlier removal technique
in Excel.
Thanks!



Hi!! I've been trying with no luck to come up with a way to:

a) find all the outliers in a range of a 1000 rows. The criteria is simply +/- average*2. The outlier can either be above or below.
b) colour the background of these outliers.
c) repeat a&b in the next column.

So in essence, I'm trying to find all values in a table that are twice (positive or negative) that of the average and highlight them.

Example:
A, B
4, 1
4, 1
4, 1
4, 4
4, 1
4, 1
4, 1
12, 1

In this case the outlier for A is 12 and the outlier for B is 4.

Any help, be it in a formula for a given cell, or VBA, is deeply appreciated.

Sincerely
Kurifodo


Using the averaging approach and descriptive statistics can anyone tell me how do i find outliers? i know how to do it using the 5-number summary but cant figure out how to do it this way


thanks.


hi all,

i have a range of numbers in C5:AL5
Not all of the cells are filled in, and some that are filled in are blank.

In AM5 I calculate the average of the data
and in AN5 the standard deviation where the value in the cells in the range are >0

Is it possible to calculate a standard deviation in another cell which excludes those values in the range which are 3 standard deviations away from the value in cell AN5.
Therefore, the standard deviation with the outliers.

If there is another way of doing this, i would be grateful for your help.
Luce


Hello,

When evaluating a set of items for their popularity, I pull popularity metrics from two different sources, aggregating each score into an average. While I am satisfied with this metric (Let's call it the 'popularity score'), I'd like to create a simplified, more meaningful score.

My initial though was to create an index using the following formula:

Code:

(D3/MAX($D:$D))*100)


Where D3 is the individual popularity score and D represents all popularity scores.

I discovered, however, that outliers would have a disruptive effect on the data. For instance, if there where several VERY popular items, other items with acceptable popularity scores would appear very low in relative terms.

Any ideas on how to account for the Outliers while still retaining the meaning in an easy to understand way?

Here is part of the data:

Here is my data (D3:D29):

0.00
0.00
0.00
42.56
0.00
680.99
0.00
0.00
60.80
0.00
18.24
0.00
85.12
170.24
0.00
0.00
297.92
103.36
0.00
0.00
0.00
30.40
18.24
30.40
72.96
0.00

A colleague suggest that I take the lognormal distribution and then take the antilog. Any idea how to implement, or if this is a sound approach?


Hello All,

I'm new to the forum and have not needed to do anything like this before but assume that excel can do this kind of thing.

I have tried to search for a solution everywhere.

OK, I have a large data set of triplicates in column F (example below):

B C D E
F 1
121
INT#6 SYBR Unknown
25.944914 2 122 INT#6 SYBR Unknown 25.577063 3 123 INT#6 SYBR Unknown 24.799803 4 124 INT#6 SYBR Unknown 27.107956 5 125 INT#6 SYBR Unknown 26.962341 6 126 INT#6 SYBR Unknown 27.189684 7 127 INT#6 SYBR Unknown 26.185757 8 128 INT#6 SYBR Unknown 26.0001 9 129 INT#6 SYBR Unknown 26.207401 10 130 INT#6 SYBR Unknown 25.44769 11 131 INT#6 SYBR Unknown 25.850075 12 132 INT#6 SYBR Unknown 26.147488 13 133 INT#6 SYBR Unknown 24.232754 14 134 INT#6 SYBR Unknown 24.747488 15 135 INT#6 SYBR Unknown 24.813463 16 136 INT#6 SYBR Unknown 25.699005 17 137 INT#6 SYBR Unknown 25.42176 18 138 INT#6 SYBR Unknown 25.657625 19 139 INT#6 SYBR Unknown 25.596703 20 140 INT#6 SYBR Unknown 24.756275 21 141 INT#6 SYBR Unknown 24.952007 22 142 INT#6 SYBR Unknown 25.594326 23 143 INT#6 SYBR Unknown 26.192163 24 144 INT#6 SYBR Unknown 24.788391
so one triplicate is 121 to 123 (data in F1:F3) then the next set is 124 to 126 (data in F4:F6).
What I want to do is compare each number within the triplicate to the others within the same triplicate and determine which are +/- 0.5 from the others and label them as outliers.
I have come up with a poor solution that works for most instances but not when two values are 0.5 from each other, neither is marked as an outlier. In this case I would like to the formula to choose the value furthest from the median as the outlier.
I would also ideally like to mark the outliers by somehow highlighting in red.

my poor solution occupies the next 3 columns G,H and I and is very clunky:

G H I 1 =IF(OR($F2($F3+0.5)),1,0) =IF(OR($F2($F4+0.5)),1,0) =SUM(I2:J2) 2 =IF(OR($F3($F2+0.5)),1,0) =IF(OR($F3($F4+0.5)),1,0) =SUM(I3:J3) 3 =IF(OR($F4($F2+0.5)),1,0) =IF(OR($F4($F3+0.5)),1,0) =SUM(I4:J4) 4 =IF(OR($F5($F6+0.5)),1,0) =IF(OR($F5($F7+0.5)),1,0) =SUM(I5:J5) 5 =IF(OR($F6($F5+0.5)),1,0) =IF(OR($F6($F7+0.5)),1,0) =SUM(I6:J6) 6 =IF(OR($F7($F5+0.5)),1,0) =IF(OR($F7($F6+0.5)),1,0) =SUM(I7:J7)
The if the result in column I is 2 then the corresponding value in the data set is an outlier.

Can anyone help?
I hope my explanation of the problem is understandable.

This is really bugging me. If I can get the formula to work then It will save me days of trawling through data picking outliers by eye.

I eagerly await a solution and will try things of my own in the mean time but as you can see I am far from an excel master.

Cheers


Hi everyone...

here is the situation...i need a formula for median to exclude the min and the max value ..that are above 0!!!

i tried this formula ...(ARRAY!) but i get Num error! ..any ideas?


suppose H28:H32 is the data range!

{=+MEDIAN(IF(H28:H32>0,IF(H28:H32>MIN(H28:H32),IF(H28:H32<MAX(H28:H32),H28:H32))))}


Is there a good way of excluding an outlier in an average calculation. In the example below will I exclude 1000 from the average-calculation.

The way to decide excluding-values can either be a percent based on the range or everything that is a higher than a user defined value. It can also be more than one outlier.
A user defined function is OK with me, if it is impossible to use the built-in functions. Any suggestion?

Code:

1
2
3
4
5
6
7
1000
8
9
10
11
12





Hi.

I am analyzing sales figures and need some assistance. I have a variety of products and a variety of territories where these products are sold. Additionally I have Gross Profit Margin figures for each product sold and many of the same products have differing GP % due to flexibility of sales agents setting the sale price.

Ideally I would love to find out what products are the most profitable, along with which areas these products sell for the most effectively.

However, after developing a Pivot Table with products in rows and Territories in Columns, I realized some products have "outlier like" GP%. These products have zero revenue as they were given to the customer as a freebie in a large contract which throws off my analysis on how profitable the product category is.

Would you recommend that I remove outliers and solely calculate based on 'normally' priced products? And second, I am also suppose to analyze large order (contain many products), how do you suggest I do that (especially with consideration to question 1)??

Thanks!!


Can anyone help with a box and whisker related graph?
I can do the 50% quartile method of sorting this out but for a piece of research I need to study the middle 75% and exclude the top and bottom 12.5% of outliers only.
I tried the percentile approach to calculate the top and bottom 12.5% but this of course calculates figures that don't match up to actual data points, unlike the quartile method and median which actually do match up to the data points. As i need my box graph to match up to all the data points in my presentation the percentile approach doesn't work for me.
Does anyone have any idea how this can be done?
Thanks


Hi all,

I'm trying to learn how conditional formatting works, and I kind of have the gist of it but I can't get it to work after calculations. Here's what I'm trying to do:

I want an icon to appear after a sum calculation formula so that I can spot outliers easily. It needs to calculate the sum of 3 numbers (found in L2,M2,N2) and if that sum is between 5 and 22 then let a particular icon appear, but if it's outside that range, then let another icon appear. Icons should appear in cell O2. I'm going to copy the formula down the column so it doesn't need to use absolute references.

I'm not sure how to set it up because I'm confused on my syntax of commands. I know I probably need to use "countif" or "if", but where do I place the "sum" function?


Any help is greatly appreciated.


I'm attempting to do my first forecast, but I seem to be failing miserably. When I use the FORECAST function all of the forecasted totals are much too high. I

Is there a way to exclude outliers or use a moving average..... or am I just using the function completely wrong?

Any pointers would be greatly appreciated, when it's all said and done I would love to take the previous 6 months of processed data and use that to forecast the processed amounts for the next 6 months... but as it stands now these numbers won't be nearly accurate enough.


Hi,

Sorry if this comes off as very basic, but it's such a specific answer i'm looking for, and think this would be the fastest way to solve the problem. Here's my scenario:

I am analyzing data in a lab, and need to filter this data for outliers before taking it any further. The way we've decided to filter the data is to calculate (2 x standard deviation) for each column, and add/subtract this calculation from the average for each column to establish "ceiling" values for upper and lower extremes. Any data point exceeding the ceilings is to be attenuated to the ceiling itself.
For example, suppose a column has an average of 100, and a standard deviation of 50. In this case, the ceiling values would be 0 and 200. If a data point is found to be, say, 213, it would be adjusted to 200. If a data point were e.g. -23, it would be changed to 0. Make sense?

At this point, I am just doing calculating the ceiling values, and checking for outliers by eye. But this is both time consuming, and has an inherently higher risk for error. I want to make a formula that will automate these adjustments, rather than do this by eye.


Here is the approach I'm thinking of, but I may be way off:

For each data point: =IF(A2>(2*StDev(A2:A8)), 2*StDev(A2:A8), ?)

(A2:A8 is, of course, just an example)
(? is where I am stuck)

This is where I'm stuck - I have no idea how to revert the cell to its original value in the case that it ISN'T an outlier and doesn't need to be changed.


Also, I think I can figure this part out eventually, but I'll throw it out there anyway: This If statement actually needs to be 2 in 1, in that it filters both exceedingly low AND high data. I think this will require the OR statement, but I need to study it more at the moment.

Any help is appreciated!


Hello all,

I'm trying to get a standard deviation of data range, excluding the outliers. I'm excluding

I have a set of data containing outliers based on a median and mean. The data is all in one column and the median and mean are each calculated in one cell via the average and mean formulas. in another column, if a value is an outlier i have it marked as "high" "low" or "" depending on whether its high, low, or not an outlier. What i would like to do is have the mean and median functions which currently look like =AVERAGE(A4:A32) exclude data that has "high" or "low" in the same row B column.

Thanks!


I can't see the forest through the trees on this one. My spreadsheet is designed to calculate waiting periods between each step of a multi-step process, i.e. Person A makes an appointment with GP, date of appointment, waiting time? Referral to specialist, date of appointment, waiting time? Diagnosis, date of procedure, waiting time. I am sure you get the drift.


My waiting time calculations utilise the NETWORKDAYS function - 1 and eliminate outliers created by empty cells with =IF(AND(F16

You guys are geniuses so hopefully someone has a good idea of how I can save a lot of time.

I have one worksheet with sales history per product by day. (dates are collumns, products are rows, qtys in the middle, all are values, not formulas). (I'll be using the Excel document as a workbook for future data, too, so whatever I create needs to work no matter what I paste into this raw data sheet.)

Then, I created a worksheet that pulls over the data from the first sheet, but also adds a row of trendline... Averages the 14 previous days' qtys. (so the trendline starts on day 15.)

I would like these numbers to paste back onto the first sheet, overwriting the existing data.

Constraints:
- Every time i do this, there may be up to 500 products, so cutting & pasting manually isn't an option.

- Some of the data points are 0 in the raw data sheet. (and some outliers are way high) Ideally, these would not be calculated in the average/ trendline.

Ideas:
- I could calculate the std dev of the data for each product & for hte average formula put in an IF stmt like if the value is more than 2 stdev of mean not to calculate it? Not sure how to write that formula.

- I could create a macro that will take the average data & paste it over the raw data. (My concern here is that the products' data don't all start on the same day. Also don't knwo how the macro would know to put it in the right day.)

Help is greatly appreciated! Have a good night!


Hi there,
I have a worksheet containing data in columns. For each column, I would like to calculate the standard deviation of the mid-95% of the values i.e. 2.5th percentile to 97.5th percentile (to reduce the effect of "outliers").

So first of all I calculate the values of the 2.5th and 97.5th percentiles:

A102 (2.5th percentile) = PERCENTILE(A1:A100,0.025)
A103 (97.5th percentile) = PERCENTILE(A1:A100,0.975)

Then using an array formula shamelessly plagiarized from this very website, I calculate the standard deviation of the values that lie in that range:

{=STDEV(IF((A1:A100>=A102)*(A1:A100

Hi everyone!

I've got the following code which is meant to find cells with no values, colour the cells in, and then replace the missing value with the average of that column. the average is located in a different sheet

Code:

Sub Outliers()

Dim i, j, row, col As Integer

' Prompting messages
col = Application.InputBox _
             (Prompt:="Please enter the number of rows ", _
               Title:="Row Number", Type:=1)
row = Application.InputBox _
    (Prompt:="Please enter the number of colums ", _
        Title:="Colum Number", Type:=1)

For i = 1 To row
    
     For j = 1 To col
    
        If IsEmpty(ActiveCell) = True Then

'Copy cell value and paste into cell with missing value
            
 Sheets("Part1DATA").Select
            Range("j" & "4").Select
            Selection.Copy
            Sheets("TrainingData").Select
            Range("j" & "i").Select
            Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
            :=False, Transpose:=False

  'Make the cell Red
    
            With Selection.Interior     
                .Pattern = xlSolid
                .PatternColorIndex = xlAutomatic
                .Color = 8420607
                .TintAndShade = 0
                .PatternTintAndShade = 0
            End With
       
        End If
'Move one row down
        ActiveCell.Offset(1, 0).Select 
        
    Next j
'Move up to the top of the table and across
    ActiveCell.Offset(-col, 1).Select 
    
Next i


It throws error 400. However, I have the same code (MissingValues), which colours it all in red, as shown on the file submitted.

The code will later be used to more or less the same extent, except for, it will search for outliers (based on outlier limit) and colouring in the outliers in yellow (hence the name of the sub)

Much appreciate any help!

Kurifodo


Does excel have a calculation for Winsorising or Winsorization

It's a statistical calculation to round up or down outliers vs trimming or removing them from the data pool.

Thanks


Hi,
I'm having a problem with my boxplots and they don't seem to show the correct position of the box itself in regard to outliers and Q1 & Q3 etc. Also I would like to narrow them down a bit if possible.

If anyone would also like to critique my other charts ie have I done them properly I would be very grateful to know.

Cheers
John


A lot of times I will have a table of financial ratios wherein there is a median value calculated as the last column. We will sometimes exclude values in the table if they are outliers. Our convention is to bold those items and then exclude that cell from the formula.

What I would like to be able to do is select the median formulas and then run a macro that would take that selection, look at the formulas in the selection and then select all of the cells that the formulas take into consideration when calculating the median values.

I would then be able to very quickly see that all of the formulas were updated properly. You can go down the list of median values manually, but it is cumbersome with a long list.

The code I have now is below. What I am struggling with is a way to determine the range the median formulas reference and then how to select that total range.

Any help with this would be appreciated. Maybe there is an easier way to accomplish the same thing.


Sub ShowRanges()
Dim rngCell As Range
Dim rngFormulaCellsOnly As Range,
Dim rngResults As Range

' Select only cells with formulas.
Set rngFormulaCellsOnly = Selection.SpecialCells(xlCellTypeFormulas)

' Loop through all the currently selected cells that contain formulas.
For Each rngCell In rngFormulaCellsOnly

' Look at formual in current cell and add to rngResults
' This is the part I am struggling with.






Next rngCell

' Select the cells
rngResults.Select

End Sub