Ai Chronicles: Technical Support Edition - Part 3

Introduction

You can find Part 2 of this series here.

Now that we have the transcript of the video, next step is to summarize it in chunks that first of all make sense and secondly are short enough to be useful. For this you can use any of the summarization models out there, I will be using facebook/bart-large-cnn from huggingFace for this example.

It is one of the best models out there for summarization and it is very easy to use. Before how lets discuss why we need to summarize the video in the first place.

Why Summarize?

In the previous article we got from video to audio to transcript. Now we have a transcript of the video, but it is still a lot of text to go through. It is ok to go through it once or if you just have a small dataset of videos. But finding keywords or text in a transcript is not the way we want to do it. That process is slow and tedious. Not to mention it involves a lot of manual work and automation is the key here.

As a support enginner we should always try to automate as much as possible. To help our customers as quickly as possible.

So why not automate this process as well? This is where summarization comes in. We can use a summarization model to summarize the transcript of the video. This will give us a short summary of the video which we can use to quickly go through the video and find the information we need.

Now this is a 2 parter, we get the summaries which later give overall context of the video and we get the summaries of each segment of the video. This way when we are using prompt in the end to search through our knowledge base, we first findout what relates to the user query and then we can go through the summary of that segment to get more details. Which in the end points to the part of the transcript and the video where the answer to the query is. And that completes the loop.

Summarization

But we also don't want a summary that is too short and doesn't make sense. So we need to find a balance between the two. So for the video we are using in this example, again you can find the details in previous article, the transcript came out to be around 28000 characters long which comes to be around 6k tokens in AI world.

So, what we will do it take the text around 3000 characters in one batch and then combine the summaries of all the batches to get the final summary. This way we have 2 summaries: one for the whole video and one for each batch. This way we can quickly go through the summary of the whole video and if we need more details we can go through the summary of the batch.

We can split text using langchain text splitter which is a great library for splitting text into chunks. As it takes the chunk size and also the overlap you want between text chunks. But again to keep things simple we will just split the text into chunks of 3000 characters. e.g

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)

Code For Summarization

macOS and Linux version

from transformers import pipeline
import os

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")


def summarize():
    os.chdir(os.getcwd() + "/transcripts")
    for file in os.listdir(os.getcwd()):
        if file.endswith(".mp3-transcript.txt"):
            with open(file, "r") as f:
                txt = f.read()
                txtChunks = splitText(txt)
                txtSummaryChunks = []
                for chunk in txtChunks:
                    txtSummaryChunks.append(
                        summarizer(
                            chunk, max_length=500, min_length=30, do_sample=False
                        )[0].get("summary_text")
                    )

                summary = ""
                for chunk in txtSummaryChunks:
                    summary += chunk
                segmentSummary = summary
                wholeSummary = summarizer(summary)[0].get("summary_text")

                #  save both summary in the transcripts folder

                # os.chdir(os.getcwd() + "/transcripts")
                with open(file + "-segment-summary.txt", "w") as f:
                    f.write(segmentSummary)

                with open(file + "-whole-summary.txt", "w") as f:
                    f.write(wholeSummary)


def splitText(txt):
    # split the txt into buckets of 3000
    txtChunks = []
    for i in range(0, len(txt), 3000):
        txtChunks.append(txt[i : i + 3000])
    return txtChunks


def main():
    summarize()


if __name__ == "__main__":
    main()

Windows Version

from transformers import pipeline
import os

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")


def summarize():
    os.chdir(os.getcwd() + "\\transcripts")
    for file in os.listdir(os.getcwd()):
        if file.endswith(".mp3-transcript.txt"):
            with open(file, "r") as f:
                txt = f.read()
                txtChunks = splitText(txt)
                txtSummaryChunks = []
                for chunk in txtChunks:
                    txtSummaryChunks.append(
                        summarizer(
                            chunk, max_length=500, min_length=30, do_sample=False
                        )[0].get("summary_text")
                    )

                summary = ""
                for chunk in txtSummaryChunks:
                    summary += chunk
                segmentSummary = summary
                wholeSummary = summarizer(summary)[0].get("summary_text")

                #  save both summary in the transcripts folder

                # os.chdir(os.getcwd() + "/transcripts")
                with open(file + "-segment-summary.txt", "w") as f:
                    f.write(segmentSummary)

                with open(file + "-whole-summary.txt", "w") as f:
                    f.write(wholeSummary)


def splitText(txt):
    # split the txt into buckets of 3000
    txtChunks = []
    for i in range(0, len(txt), 3000):
        txtChunks.append(txt[i : i + 3000])
    return txtChunks


def main():
    summarize()


if __name__ == "__main__":
    main()

Lets look at the generated summaries:

Whole Video Summary:

Scikit-learn is a set of well-known packages for martial learning code. Python is getting more and more packages for computational science. Scikit is based on NumPy and SciPy as you know. It offers support for large scale computation out of the box. It can be integrated with NLTK, that is natural language toolkit.

Segments Summary:

Machine learning is about algorithms that are able to analyze, to crunch the data, and in particular, to learn the data from the data. Machine learning is almost related to data analysis techniques.Scikit-learn is a set of well-known packages for martial learning code. Python is getting more and more packages for computational science. Python provides unique programming language across different applications.Scikit-learn, PyML, Natural Language Toolkit, NLTK, sometimes called the Shugen Martial Learning Toolbox. Spark Martial Learning Lib, PyBrain, MLPy.Scikit-learn is a martial learning library and its goal is to provide a set of common algorithms to Python users through a consistent interface. It includes all the batteries necessary for general purpose martial learning code. Feature selection, feature extraction algorithms, martial learning algorithms in general in different settings.Scikit is based on NumPy and SciPy as you know. All the data are usually represented as matrices and vectors. So the data comes, the training data come in this flavor and under the hood it is implemented by SciPy.sparse matrices.Scikit has a great package to handle the data sets. Actually these particular data sets are very well known in many fields and is already embedded in the Scikit-learn library. In Scikit, few lines of code. We import the data set. We call the KNN classifier algorithm. In this case, we select N neighbors equals to one. Then we train our model.The interface for the two algorithms is exactly the same even if the machine learning settings are completely different. In the k-means, in this case, we want three clusters because we're going to predict three different species for the iris. The confusion matrix is a matrix where it's the number of, it has a square matrix where the rows and the columns corresponds to the numbers you want to predict.Scikit-learn is a Python library for machine learning. It offers support for large scale computation out of the box. It can be integrated with NLTK, that is natural language toolkit.It is powerful and in my opinion easy to use, very efficient implementation provided it's based on NumPy, Scipy and Sighten under the hood. It is highly integrated for example in NLTK or Scikit image just to make an example. We have six minutes left for your questions. Please raise your hand and I'll come by with a microphone.In general, you apply data normalization steps. If you find the right model you want to use, then you're required to find the best settings for that model. In that case, you might end up using the grid search method, for instance.

And this looks pretty good. We have a summary of the whole video and we have a summary of each segment. And all we had to do is copy the video into our app and run the script.

Conclusion

So in our journey to make the full use of AI to help us with our daily tasks, we have come a long way. We started with a video, then got its audio, then we got the transcript of the audio and finally we summarized the transcript.

Next Steps

In the next article we will see how we can convert our summaries into vector embeddings and then store them in a postgres database. Which will later act as our search engine.

Stay tuned for the next article in this series.

Back To All Blogs