Social Media: Monitoring and Analysis System

ЗАДАЧА
The developed System is an automatically composed news feed. User activity is monitored and a customized news feed is generated based on the user’s interests.

The System continuously collects information from more than 50 000 sites in the English part of the Web and processes at the average 80 000 new articles per day. Besides, it updates data on more than 1 million of already existing articles.

The client requested expert review from Byndyusoft after one of the leading developers of custom designed software in the Eastern Europe failed to design and implement a scalable solution with sufficient performance.


Problems of version 1.0:

  • The System was a set of isolated projects, unconnected with each other.
  • Delivery of data required manual adjustment of each project for every separate site. This did not allow to increase the delivery of new articles.
  • Data processing did not provide required integrity — up to 50% of articles disappeared, processing errors were not logged in a proper manner.
  • Because of monolithic architecture there was no possibility of cheap horizontal scaling. All subsystems could be run only on a single server in one thread.
  • Processing of data lagged behind the input of new data, no optimization was performed.
  • Output of content to user was irrelevant, algorithms had some major errors.

Reasons of the problems:

  • The project had no integral architecture. Every service was developed separately. The services could interact with each other only through the shared DB.
  • Usage of common DB resulted in constant blockings with hanging of all parts of the system.
  • The system depended on external services, which provided limited amount of API requests, and did not ensure necessary data integrity.
  • Many services were not tested, had critical bugs and problems related to memory leak.

Works on version 1.0 were performed by a team of 4–6 high-priced IT specialists for about half a year.

After Byndyusoft took over the development, the tasks for the first six months were stated as follows:

  • Ensure continuous conveyor delivery of data.
  • Provide an opportunity for scalability of data delivery services and front-end sites.
  • Ensure the integrity of downloaded data, and for this purpose it was necessary to increase delivery by a factor of 8–10.
  • Improve the quality of data processing, add missing services. Main services included clearing of text from advertisements, analysis of similar pictures, analysis of similar texts etc.
  • Completely rewrite user interface, optimize the speed of endless feed for all modern browsers and mobile devices.
  • Ensure integration with social networks, particularly with Facebook — release an application for Facebook.
  • Arrange fast and relevant output of content to users.
  • Ensure fail-safety and replication of the whole system.

Conclusions, drawn by the Byndyusoft team from mistakes, which had been made by the previous team in the design of the system, wide experience in development of high load systems and professional usage of flexible development techniques allowed to create a new version of the System in compliance with all client’s requirements in the shortest time possible.

РЕШЕНИЕ

Within 6 months, a team of 6 persons developed an operational project, meeting all the assigned objectives.


Conveyor for processing of articles and images

Byndyusoft team has designed and created the required conveyor of full article processing cycle:

  1. Downloading and recognition of article list from the source (RSS-feed, custom API for the sources), obtainment of links to articles.
  2. Clearing of links from all redirects and obtainment of final link to an article.
  3. Downloading of HTML page with the article.
  4. Locating of article body.
  5. Tagging of the article — text analysis and highlighting of key words.
  6. Clearing of the article from advertisements and «garbage» sections.
  7. Locating and downloading of images/videos, relating to the article.
  8. Checking if the text of the article completely matches with already available texts, checking of similarity of the text with already existing ones, so that there were no duplicates in the user feed.
  9. Checking if the images completely match with already available images, checking of similarity of the images with already existing ones, so that there were no duplicates in the user feed.
  10. Checking in the browser if the original article can be displayed in iframe.
  11. Identification of social characteristics of the article — number of likes/reposts of the article in Facebook, number of comments to the article, number of tweets with a link to the article.

Architecture and horizontal scalability

Scalable model of services, interacting through common data bus (RabbitMQ). Processes of reading and writing to database are carefully assigned between the services to prevent interlocking. All information for content output to the user is copied to MongoDB cluster to increase the speed of data output.

Created mechanism of data output to the UI was designed and developed from scratch. In the process of designing main focus was on fast and relevant output of content to users. All front-end sites synchronize cache among each other, and interact with delivery services by means of common bus. Because of this, new articles may make it into the output to users before the article is physically saved in the DB.

Based on zabbix and in-house projects a powerful tool for data analysis and tracking of current state of operation of all project infrastructure was created.


Android client

API provided by web servers for the site UI allowed to create Android application without any changes on the server part. The net result is that both main and mobile versions of site UI, and Android application use the same API.

РЕЗУЛЬТАТ

Results of work, provided by the Byndyusoft team had the following
advantages for the client:

  1. Byndyusoft was able to provide an operable
    solution, which completely satisfies the client.
    Initial working versions were presented already after
    a month from the start of development,
    while the previous team could not provide
    any working version within 6 months of works
    on the project.
  2. Byndyusoft was able to fulfill all requirements related
    to quality and speed of data processing within 6 months
    and at the same time to ensured performance margin due
    to which initial set of data sources was extended afterwards.
  3. Arrangement of fail-safety and testing of different failure scenarios
    allowed to avoid interruptions in the system operation in the event
    of real equipment failures and emergencies.
  4. Duration of daily consultations with representatives of the client reduced
    from 2 hours in the first days of the project to 15 minutes by the second month
    of development.
250,000
POSTS PER DAY
1 TB
MS SQL SERVER
8,000
MESSAGES PER SECOND IN RABBIT MQ
Web · Cloud · SaaS · Microservices · Big Data
Узнайте больше о наших продуктах Заказать презентацию