How to delete all except Foreign Language

1

Hi All

I'm stuck and need your help on this

I need to retrieve some very important information from a chat back up (.txt file) with thousands of lines in it.

When opened in Excel, the data in column 'A' has a mix of Date, Numbers, Special characters, Emojis, Empty Rows, English text and a Foreign Language text (Telugu Language).

 (SAMPLE FILE WITH A FEW LINES ATTACHED)

  • As of now, I need to extract just the Foreign Language (Telugu Language) excluding everything.
  • Also would be of great-great help if I could find a solution with which I could select or choose the character types I need to omit or keep. 

Thank you. 

Answer
Discuss

Discussion

Dear
Thanks a lot for your time and effort.
Since I'm just a moderate user of excel, please advice how to use this function.

Thanks once again...
Vijay
Vijay_91 (rep: 2) Jul 25, '17 at 11:01 am
Try it on a worksheet of which you have a copy!
While looking at the worksheet you wish to change (that is the "ActiveSheet") switch to the VBE Editor (Alt+F11 is one way). Place the cursor anywhere within the sub I supplied. Press F5. If your worksheet has many thousands of rows there may be a lag while the macro works. Try it on a smaller volume first to get a feel for the time.
I made this sub "Private" because I don't make anything public that doesn't need to be public. If you wish to call this sub from the worksheet (without opening the VBE) remove the word "Private" in the declaration (first line). Google for how to call a macro from the worksheet and follow one of the options you will find.
Variatus (rep: 2165) Jul 25, '17 at 9:26 pm
Thanks Variatus
I could shorten the lines from 18k to 6k by executing your code.
As you've mentioned earlier, it's not a perfect result though.It still has many special characters, English text, and Emojis.
I'm attaching the result sheet ("chat extract") for your reference. I Request you to further refine the code if you can to eliminate all the mentioned above.
I have some more data sheets which are multiple times of the size I've tried here.
It'll be of great help if you could provide me a refined version.
Regards
Vijay 
Vijay_91 (rep: 2) Jul 26, '17 at 7:19 am
I'm glad the code helped you. Actually, I also gave you the tools to improve the result, and told you how to use them. So, it's you time or mine, right :-)
As you whittle the problem down you may come across specific problems. Once you find the question I might be able to come up with an answer.
Variatus (rep: 2165) Jul 26, '17 at 10:25 pm
Hi Variatus

Sure will try to refine it by myself.
Will get to you if stuck anywhere... :)

Thanks and Regards
Vijay
Vijay_91 (rep: 2) Jul 27, '17 at 1:17 am
Add to Discussion

Answers

1

The code below will somewhat do the job.

Private Sub TeluguOnly()

    Dim Exceptions() As Variant
    Dim Txt As String
    Dim Ch As Long
    Dim i As Long, f As Long
    Dim Rl As Long
    Dim R As Long
    
    Exceptions = Array(8206, 8209, 8234, 8236, -8194, -8253, -8676, -8677, -8698, _
                      -9072, -9105, -10176, -10179, -10180)
    Application.ScreenUpdating = False
    With ActiveSheet
        Rl = .Cells(.Rows.Count, "A").End(xlUp).Row
'        For R = 47 To 1 Step -1                 ' for testing
        For R = Rl To 1 Step -1                 ' no header row
            Txt = .Cells(R, "A").Text
            If Len(Txt) Then
                For i = Len(Txt) To 1 Step -1
                    Ch = AscW(Mid(Txt, i, 1))
                    If (Ch > 255) Or (Ch < 0) Then
                        For f = UBound(Exceptions) To 0 Step -1
                            If Ch = Exceptions(f) Then Exit For
                        Next f
                        If f < 0 Then Exit For
                    End If
                Next i
            Else
                i = 0
            End If
            If i Then
'                Debug.Print R, Ch, Txt     ' for testing
'                Exit For                   ' for testing
            Else
                .Rows(R).EntireRow.Delete
            End If
        Next R
    End With
    Application.ScreenUpdating = True
End Sub

"Somewhat" because it isn't perfect. Basically, it looks for unicoe characters. If it finds one it determines that the line is in Telugu language. While all Telugu characters are Unicode, not all Unicode characters are Telugu. Hence the list of exceptions. There are characters 8234 and 8236 which appear to be blank spaces which I found in lines 114:117. Most of the others I have identified are emojis. There are over a thousand emojis, and I have identified a dozen or less.

So, I have left some lines of code for testing purposes. The line "For R = 47 to 1 Step -1" can be activated only when  the one below it is deactivated, and v.v. The intention, actually, is to look at line 47 in this case. It showed up as Telugu and I wanted to know which Unicode character was in it.

The other two lines marked "for testing" will print information about one line and then stop the program. The info will be printed to the Immediate window (Ctl+G in the VB Editor). It will print the row number, the Unicode number of the character which was found, and the line of text. That is how I found the emoji Unicode numbers and added them to Exceptions. You can find more recurring numbers like that and improve the filter eventually.

When the program isn't interrupted it will delete all lines which have no Unicode characters in them (exceptions and blank lines included). It will do its majic on the ActiveSheet (even if it is the one you didn't mean to mutilate), and it will work on the text in column A.

Discuss

Answer the Question

You must create an account to use the forum. Create an Account or Login